Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Improve PDFium text extraction #11

Merged
merged 5 commits into from
Oct 31, 2023

Conversation

mqq-marek
Copy link
Contributor

@mqq-marek mqq-marek commented Oct 29, 2023

Several additional changes:

  • ENH: Add PDFium image extraction
  • ROB: Make opening/parsing the cache file more robust
  • MAINT: Update deprecated pdantic API
  • MAINT: Add pdfrw to main.in

@MartinThoma MartinThoma changed the title Pdfium test updates ENH: Improve PDFium text extraction Oct 31, 2023
@MartinThoma MartinThoma merged commit 24c51dd into py-pdf:main Oct 31, 2023
@MartinThoma
Copy link
Member

Good work! Thank you for updating the PR 🤗

@MartinThoma MartinThoma mentioned this pull request Oct 31, 2023
@mara004
Copy link
Contributor

mara004 commented Nov 18, 2023

Just came across this, thanks for the addition!
I'd expect the benchmark results will be great with JPEG or JP2, but poor with any other formats due to limitations in pdfium's public API, especially poor with CCITT or JBIG2. Also note that this code doesn't take alpha masks into account (and some more finnicky things).

Adding pikepdf would also be nice, see #4. Programatically it's by far the best PDF image extractor I'm aware of.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants