Skip to content

Initial Version

Compare
Choose a tag to compare
@icaropires icaropires released this 15 Jul 23:38
· 61 commits to master since this release

Features

  • Conversion of a whole subdirectory with PDFs documents into a pandas DataFrame
  • Support of parallel and distributed computing through ray
  • Incremental writing of resulting DataFrame, to save memory
  • Ability to keep processing progress and resume from it
  • Error tracking of faulty documents
  • Use OCR for extracting text through pytesseract and pdf2image
  • Initial, easy to use, CLI
  • Custom behaviour through parameters (number of CPUs, text language, etc)