Skip to content

Instantly share code, notes, and snippets.

View david-rodriguez's full-sized avatar

David Rodriguez david-rodriguez

  • Georgia
View GitHub Profile
@david-rodriguez
david-rodriguez / scrapedata.py
Created November 18, 2024 14:21
This script processes PDF files, extracts text, and splits it into chunks for machine learning applications. It reads PDFs from an input directory, cleans and trims the text, chunks the text based on specified sizes, and outputs the results in JSON Lines format.
import sys
import os
import fitz
import argparse
import json
import nltk
import re
nltk.download('punkt_tab', download_dir='../.venv/nltk_data', quiet=True)