Skip to content

Releases: icaropires/pdf2dataset

Support extraction without OCR and small data volume

21 Jul 19:00
Compare
Choose a tag to compare

Changes

  • Ability to extracting text through pdftotext (without OCR)
  • Improved interface for using pdf2dataset as a lib and not through CLI
  • Reduced disk IO, document read only once
  • Add specific mode for small volumes of documents

Fix some warnings

16 Jul 15:14
Compare
Choose a tag to compare

Changes

  • Small improvements (remove some deprecated methods and text of one warning)
  • Add some important tests
  • Improve README

Initial Version

15 Jul 23:38
Compare
Choose a tag to compare

Features

  • Conversion of a whole subdirectory with PDFs documents into a pandas DataFrame
  • Support of parallel and distributed computing through ray
  • Incremental writing of resulting DataFrame, to save memory
  • Ability to keep processing progress and resume from it
  • Error tracking of faulty documents
  • Use OCR for extracting text through pytesseract and pdf2image
  • Initial, easy to use, CLI
  • Custom behaviour through parameters (number of CPUs, text language, etc)