Releases: VikParuchuri/pdftext
Releases · VikParuchuri/pdftext
Ignore special chars, break lines more aggressively
Merge pull request #12 from VikParuchuri/dev Improve line breaks, ignore special chars
Fix flattening bug
Merge pull request #11 from VikParuchuri/dev Fix bug with flattening
Fix document loading bug
- There was a bug where pdf paths were assumed to be strings - this is not always the case
ONNX model, option to flatten form fields
- Faster inference with ONNX
- Remove warning when loading scikit-learn model
- Flatten form fields into pdf
Fix bbox bug
Fixed bug that didn't unnormalize bboxes properly.
Minor performance optimizations
- Optimize dictionary access and loops to get an ~10% speedup
Add optional parallel workers
Enable optional parallel workers when extracting text. This can cause a performance hit on small pdfs, but can speed things up 2x or more on larger ones. This can be done with the --workers
flag via CLI, or via the workers kwarg.
Fix font issue
Not all spans would have the right font information before. This fixes the issue.
Work around pdfium bug
- Charbox has zero width/height when loose=True with rotation
Fix font names
- Fix logic for pulling font names
- Increase sample frequency