Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pypdfium2's new range-based text extractor #5

Merged
merged 1 commit into from
Oct 11, 2022

Conversation

mara004
Copy link
Contributor

@mara004 mara004 commented Oct 7, 2022

get_text() was boundary-based, which is not that suited for the use case of just extracting all text of a page. I believe the new get_text_range() function might both yield better results and be more performant.

This can be merged once pypdfium2 3.3 is released.

get_text() was boundary-based, which is not that suited for the use case of just extracting all text of a page.
I believe the new get_text_range() function might both yield better results and be more performant.

This can be merged once pypdfium2 3.3 is released.
@MartinThoma
Copy link
Member

https://pypi.org/project/pypdfium2/#history - I'm super curious to see the new results :-)

@mara004
Copy link
Contributor Author

mara004 commented Oct 10, 2022

FYI, v3.3 was released today

@MartinThoma
Copy link
Member

image

Overall, it's faster. However, Pdfium became a lot slower for https://arxiv.org/abs/2201.00214

@MartinThoma
Copy link
Member

The quality didn't really change:

image

@MartinThoma MartinThoma merged commit 2417507 into py-pdf:main Oct 11, 2022
@MartinThoma
Copy link
Member

Thank you for the update @mara004 🙏

As an unrelated side-note: Is Pdfium also able to extract images? We now also have this in the benchmark:
https://github.com/py-pdf/benchmarks/blob/main/benchmark.py#L170-L195

@mara004
Copy link
Contributor Author

mara004 commented Oct 11, 2022

Thanks for the new results. I didn't expect it to be really different, given that it's still the same base. get_text_range() is just a bit nicer internally.

The slowdown on the first document appears suspicious to me, however. If I run time pypdfium2 extract-text "2201.00214.pdf" --strategy range multiple times, I get

real    0m2.919s
user    0m0.730s
sys     0m0.206s

for the first run, and something like

real    0m0.676s
user    0m0.750s
sys     0m0.162s

for all following runs.

Is it possible that there's some external explanation for the slowdown and that (py)pdfium is not at fault here?

@mara004
Copy link
Contributor Author

mara004 commented Oct 11, 2022

As an unrelated side-note: Is Pdfium also able to extract images? We now also have this in the benchmark:
https://github.com/py-pdf/benchmarks/blob/main/benchmark.py#L170-L195

In principle, yes. There are some functions in the raw API for this (FPDFImageObj_GetBitmap(), FPDFImageObj_GetRenderedBitmap(), FPDFImageObj_GetImageDataDecoded(), FPDFImageObj_GetImageDataRaw(), ...).

The problem is: if we use one of the FPDFImageObj_*Bitmap() functions, the result is only suited for displaying, not saving. Otherwise we would have to re-encode the pixel data, which would be disadvantageous in terms of performance / compression / quality. So I'm not sure how to best create a support model for this.

There are also functions to get the raw data, filters and metadata, but this approach would be quite complicated, and I'm not even sure if the information provided by PDFium is sufficient (e. g. Alpha masks, ICC profiles ?).
(As I stated in #4, pikepdf is able to smartly handle the raw data and "reconstruct" the original image in many cases.)

I've discussed this topic with a user already (there's an open issue about it).

@mara004 mara004 deleted the patch-1 branch October 11, 2022 10:45
@mara004
Copy link
Contributor Author

mara004 commented Oct 11, 2022

For reference, even with the previous extraction strategry, there was already a slowdown from 0.7s to 1.7s with update 1ce8729. Given the time results I shared above, 0.7s would be more plausible.
Is it possible that this is related to Python bytecode compilation on first run or something?

@mara004
Copy link
Contributor Author

mara004 commented Nov 17, 2022

As an unrelated side-note: Is Pdfium also able to extract images? We now also have this in the benchmark:

@MartinThoma pypdfium2 now has a function for this in the devel branch (PdfImage.extract()), but it's not as good as it theoretically could be, since PDFium does not expose all required information, as I hinted above.
(I wrote an essay about that in PDFium's bug tracker: https://crbug.com/pdfium/1930)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants