Releases · icaropires/pdf2dataset · GitHub

21 Jul 19:00

icaropires

Support extraction without OCR and small data volume

Changes

Ability to extracting text through pdftotext (without OCR)
Improved interface for using pdf2dataset as a lib and not through CLI
Reduced disk IO, document read only once
Add specific mode for small volumes of documents

Assets 2

16 Jul 15:14

icaropires

Fix some warnings

Changes

Small improvements (remove some deprecated methods and text of one warning)
Add some important tests
Improve README

Assets 2

15 Jul 23:38

icaropires

Initial Version

Features

Conversion of a whole subdirectory with PDFs documents into a pandas DataFrame
Support of parallel and distributed computing through ray
Incremental writing of resulting DataFrame, to save memory
Ability to keep processing progress and resume from it
Error tracking of faulty documents
Use OCR for extracting text through pytesseract and pdf2image
Initial, easy to use, CLI
Custom behaviour through parameters (number of CPUs, text language, etc)

Assets 2