David Rodriguez david-rodriguez

Georgia

Recently created

Least recently created

Recently updated

Least recently updated

1 file
0 forks
0 comments
0 stars

david-rodriguez / scrapedata.py

Created November 18, 2024 14:21

This script processes PDF files, extracts text, and splits it into chunks for machine learning applications. It reads PDFs from an input directory, cleans and trims the text, chunks the text based on specified sizes, and outputs the results in JSON Lines format.

	import sys
	import os
	import fitz
	import argparse
	import json
	import nltk
	import re

	nltk.download('punkt_tab', download_dir='../.venv/nltk_data', quiet=True)