-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use pypdfium2's new range-based text extractor #5
Conversation
get_text() was boundary-based, which is not that suited for the use case of just extracting all text of a page. I believe the new get_text_range() function might both yield better results and be more performant. This can be merged once pypdfium2 3.3 is released.
https://pypi.org/project/pypdfium2/#history - I'm super curious to see the new results :-) |
FYI, v3.3 was released today |
Overall, it's faster. However, Pdfium became a lot slower for https://arxiv.org/abs/2201.00214 |
Thank you for the update @mara004 🙏 As an unrelated side-note: Is Pdfium also able to extract images? We now also have this in the benchmark: |
Thanks for the new results. I didn't expect it to be really different, given that it's still the same base. The slowdown on the first document appears suspicious to me, however. If I run
for the first run, and something like
for all following runs. Is it possible that there's some external explanation for the slowdown and that (py)pdfium is not at fault here? |
In principle, yes. There are some functions in the raw API for this ( The problem is: if we use one of the There are also functions to get the raw data, filters and metadata, but this approach would be quite complicated, and I'm not even sure if the information provided by PDFium is sufficient (e. g. Alpha masks, ICC profiles ?). I've discussed this topic with a user already (there's an open issue about it). |
For reference, even with the previous extraction strategry, there was already a slowdown from 0.7s to 1.7s with update 1ce8729. Given the time results I shared above, 0.7s would be more plausible. |
@MartinThoma pypdfium2 now has a function for this in the devel branch ( |
get_text() was boundary-based, which is not that suited for the use case of just extracting all text of a page. I believe the new get_text_range() function might both yield better results and be more performant.
This can be merged once pypdfium2 3.3 is released.