Releases: icaropires/pdf2dataset
Releases · icaropires/pdf2dataset
Support extraction without OCR and small data volume
Changes
- Ability to extracting text through pdftotext (without OCR)
- Improved interface for using pdf2dataset as a lib and not through CLI
- Reduced disk IO, document read only once
- Add specific mode for small volumes of documents
Fix some warnings
Changes
- Small improvements (remove some deprecated methods and text of one warning)
- Add some important tests
- Improve README
Initial Version
Features
- Conversion of a whole subdirectory with PDFs documents into a pandas DataFrame
- Support of parallel and distributed computing through ray
- Incremental writing of resulting DataFrame, to save memory
- Ability to keep processing progress and resume from it
- Error tracking of faulty documents
- Use OCR for extracting text through pytesseract and pdf2image
- Initial, easy to use, CLI
- Custom behaviour through parameters (number of CPUs, text language, etc)