Releases: VikParuchuri/pdftext
Releases · VikParuchuri/pdftext
Fix links to be in same span
What's Changed
- Misc bugfixes and improvements by @iammosespaulr in #32
- Bump version by @VikParuchuri in #33
- Dev by @VikParuchuri in #34
Full Changelog: v0.5.0...v0.5.1
Table and link extraction support
Summary
- Add table extraction support
- Add link support for references and external links
- Bugfixes
What's Changed
- fix: bbox sorting error by @simjak in #27
- Add table extraction by @VikParuchuri in #25
- Add support for PDF links and references by @iammosespaulr in #28
- Improved References by @iammosespaulr in #30
- Link support by @VikParuchuri in #29
New Contributors
Full Changelog: v0.4.1...v0.5.0
Pin pypdfium2
There's a bug with pypdfium 4.30.1 and text extraction - pinning to previous version.
Improved Segmentation with Heuristic-Based Approach
We’ve removed pdftext's reliance on the decision tree for segmenting spans, lines, and blocks and are now utilizing simpler heuristics for more efficient and accurate segmentation.
Fix loose charbox for quotes
Special chars don't work well with the loose charbox. We'll remove loose entirely soon, but this is an intermediate fix for an annoying issue with misplaced quotes.
Fix memory leak warnings
Close the PDF documents properly to avoid warnings + memory leaks.
Fix PDF flattening
Ensure it flattens when multiprocessing
Better device coordinate extraction
There were some cases where visual and text coordinates didn't align. This fixes that issue.
Revert extraction changes
Merge pull request #14 from VikParuchuri/dev Revert extraction
Python 3.13 compatibility
Merge pull request #13 from VikParuchuri/dev Python 3.13 support