Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text is Not Rendered in Certain PDFs #348

Open
TechD123 opened this issue May 30, 2024 · 18 comments
Open

Text is Not Rendered in Certain PDFs #348

TechD123 opened this issue May 30, 2024 · 18 comments
Assignees

Comments

@TechD123
Copy link

Since updating from 3.19 to the recently published 3.26 (via F-Droid), certain PDFs don't render any text. As my main use case for this app is its PDF reading feature, I have reverted to the previous version.

I'll share a link to a sample PDF once I find one that's affected. So far I've only seen this on PDFs that would leak PII if I shared them here... :P

@andiwand
Copy link
Member

Thank you for reporting this @TechD123 ! If you don't want to share the files publicly you can send them to us via mail if that is an option. Otherwise we can wait for another file which has this issue

@TomTasche
Copy link
Member

I've seen the same behavior, but can't share those PDFs publicly either. @andiwand @ViliusSutkus89 let me know if you are interested in taking a look.

@andiwand
Copy link
Member

andiwand commented Jun 2, 2024

can you put it into the private testing repo? otherwise you can mail it to me @TomTasche

@ViliusSutkus89
Copy link
Contributor

If I could reproduce it, I could tell if the problem is in upstream pdf2htmlEX or if it's on our side

@ViliusSutkus89
Copy link
Contributor

@TomTasche tried those pdf's on pdf2htmlEX docker image available on docker hub and it gives the same result. Although that official pdf2htmlEX release docker image is 4 years old and using outdated Poppler and FontForge.

I've inspected the HTML DOM, all the text is there, but hidden by CSS. I assume it's a font issue, will try to debug

@ViliusSutkus89 ViliusSutkus89 self-assigned this Jun 21, 2024
@TomTasche
Copy link
Member

Since PDF is quite important for our users, I'd actually propose to set a bounty for this issue. @andiwand @ViliusSutkus89

I'd love to have a fix for this soon.

@ViliusSutkus89
Copy link
Contributor

pdf2htmlEX-Android conan is nearly there. Then I can focus on debugging upstream pdf2htmlEX

@ViliusSutkus89
Copy link
Contributor

Did some debugging, usually this error happens when fontforge errors out while trying to save a "malformed" font. Don't know how much of that malform is actually bad font and how much of that is bad parsing. Will see what I can do to workaround

TomTasche added a commit that referenced this issue Jul 15, 2024
…ion-bump

Upgrade pdf2htmlEX-Android to 0.18.25 to workaround issue #348
@TomTasche
Copy link
Member

A fix for that is rolling out now, thanks a lot @ViliusSutkus89 ! 🎉

@TechD123
Copy link
Author

After the aforementioned 3.26, the next/latest version on F-Droid is now 3.31 which I just gave a spin (was on 3.19 the entire time before).

Unfortunately, a similar issue has appeared, where certain PDFs will either open on a blank screen or render only seemingly random words or letters. I have attached a few for convenience, but the error rate seems far higher than 3.26 based on my gut feeling.
verylong.pdf
long.pdf
short.pdf

I hope you'll give it another look :)

@ViliusSutkus89
Copy link
Contributor

True. Can reproduce on all three files

@ViliusSutkus89
Copy link
Contributor

OK, I've debugged the issue. When opening documents we first try to load them through OdfLoader, if that fails, only then we try pdf2htmlEX-Android or other loaders. For most of the pdf documents OdfLoader fails. But for these documents, OdfLoader doesn't fail, and since it doesn't fail, pdf2htmlEX isn't called. Yes, the output is totally broken when parsed by OdfLoader, but OdfLoader thinks that it's ok.

I've also noticed that we haven't updated pdf2htmlEX and wvWare versions here, but it's not the cause of the issue.

This issue will automagically go away once we solve #369 , which is blocked by opendocument-app/OpenDocument.core#387 . Will try to get on it this week

@andiwand
Copy link
Member

@TomTasche should we just surround this by detecting if it is a PDF on the platform side?

TomTasche added a commit that referenced this issue Oct 30, 2024
@TomTasche
Copy link
Member

OK, I've debugged the issue. When opening documents we first try to load them through OdfLoader, if that fails, only then we try pdf2htmlEX-Android or other loaders. For most of the pdf documents OdfLoader fails. But for these documents, OdfLoader doesn't fail, and since it doesn't fail, pdf2htmlEX isn't called. Yes, the output is totally broken when parsed by OdfLoader, but OdfLoader thinks that it's ok.

Thanks for the investigation! I found a (dirty) workaround for now in #377 and am rolling it out as we speak.

@andiwand can you check why the core does not fail for these PDFs? I can't remember exactly, but I don't think that's "working as intended".

@TomTasche
Copy link
Member

(Note that my change does not fix the original issue reported here. I think that's an actual bug in the PDF parser? @ViliusSutkus89)

@ViliusSutkus89
Copy link
Contributor

@TomTasche , which one looks weird? Just tried v3.32 app-pro-release.apk from GitHub releases and all 4 files (verylong.pdf, long.pdf, short.pdf and the emailed yourdocument.pdf) look decent to me.

@TomTasche
Copy link
Member

Nice! I haven't tested the documents in this thread, because I assumed they were caused by a different issue.

@TechD123 can you confirm this is resolved with the latest version?

@andiwand
Copy link
Member

@andiwand can you check why the core does not fail for these PDFs? I can't remember exactly, but I don't think that's "working as intended".

I think the current version of the core will try to open and translate all PDFs. We have no way of telling it not to try. I think the only way is to catch this upfront.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants