Skip to content

Releases: VikParuchuri/pdftext

Ignore special chars, break lines more aggressively

17 Oct 18:51
7460bf4
Compare
Choose a tag to compare
Merge pull request #12 from VikParuchuri/dev

Improve line breaks, ignore special chars

Fix flattening bug

08 Oct 16:07
5915750
Compare
Choose a tag to compare
Merge pull request #11 from VikParuchuri/dev

Fix bug with flattening

Fix document loading bug

08 Oct 13:14
56af2c1
Compare
Choose a tag to compare
  • There was a bug where pdf paths were assumed to be strings - this is not always the case

ONNX model, option to flatten form fields

08 Oct 02:36
c4f0d34
Compare
Choose a tag to compare
  • Faster inference with ONNX
  • Remove warning when loading scikit-learn model
  • Flatten form fields into pdf

Fix bbox bug

27 May 22:59
Compare
Choose a tag to compare

Fixed bug that didn't unnormalize bboxes properly.

Minor performance optimizations

24 May 18:00
Compare
Choose a tag to compare
  • Optimize dictionary access and loops to get an ~10% speedup

Add optional parallel workers

23 May 19:30
Compare
Choose a tag to compare

Enable optional parallel workers when extracting text. This can cause a performance hit on small pdfs, but can speed things up 2x or more on larger ones. This can be done with the --workers flag via CLI, or via the workers kwarg.

Fix font issue

07 May 20:14
Compare
Choose a tag to compare

Not all spans would have the right font information before. This fixes the issue.

Work around pdfium bug

06 May 17:26
Compare
Choose a tag to compare
  • Charbox has zero width/height when loose=True with rotation

Fix font names

02 May 23:57
Compare
Choose a tag to compare
  • Fix logic for pulling font names
  • Increase sample frequency