-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[13.4.2] lossy compression of pngs into jpegs when it shouldn't #940
Comments
Closing due to old version |
Sorry for the old version. I pulled a newer one (though still not the latest... might get to doing that later in the week/weekend) and both bugs were still there:
I'm not sure when I'll get the chance to test 13.4 so I'll leave it at this for now. thanks and sorry for your time p.s. scripts without the stdout/stderr noise: 1.sh
2.sh
|
@jbarlow83 reproduced on 13.4.1:
|
FYI I took a look at 2.sh's Example-compress.pdf with a text editor and noticed an extra stream (or two?) so I concatenated it with "pdftk Example-compress.pdf cat output Example-compress-pdftk.pdf" and it helped quite a bit:
Though I'm not sure if it's a bug (loose unreferenced objects?) or a feature (thumbnail? color profile? duplicate split stream for gradual web render? integrity redundancies? binary meta?...), it seems to be the cause for the size increase. |
@jbarlow83 reproduced in current current 13.4.2:
1.sh:
2.sh:
|
If you use |
1.sh:
2.sh:
|
I can confirm --output-type pdf skips the lossy jpeg compression. I removed jpeg-artifacts from
and they reappear if I use -O0 or -O1 but don't use --output-type pdf |
I have the open source PDF24 on my computer and it is able to flate a TIFF to PDF/A with no JPEG-compression artifacts, also appearing to use GhostScript. It's validated to be PDF/A by https://avepdf.com/pdfa-validation. |
@rmast Anything using ghostscript will run into lossy conversions in some cases since ghostscript doesn't support the same image formats and color profiles as pdf. The specific issue here is a nuance of:
( https://ocrmypdf.readthedocs.io/en/latest/introduction.html#limitations ) I've raised a similar issue with pdfScale.sh where I've also made some test scripts to illustrate the issue: tavinus/pdfScale#27 I should have closed the issue since it's known and documented but since Anyhow, PDF24 is closed source freeware so I won't look too much into it but if it uses ghostscript, it will have to deal with similar issues. Otherwise, FBCNN seems like a nice image restoration neural net model (I've personally used waifu2x for sheet music scaling before OCRing with Audiveris with good results) but it's still a lossy process so it's only appropriate as a mid-stage before running Tesseract. |
You’re right, I can’t find the source online. I’ll try --pdfa-image-compression lossless to see whether it preserves my TIFF.
I assumed that was already covered by -O0.
|
Unfortunately Ghostscript does not announce when it transcodes. |
It wouldn't help since there's still a lot of remaining feature mismatches between postscript, PDF and PDF/A. Btw, what would it take to have OCRMyPDF preserve existing PDF/A documents? I don't actually need it myself but if it's a big enough deal to default on despite the lossy transcoding... Well, I mean, it's not like OCRMyPDF is adding any multimedia features so if it comes down to just setting a meta flag or something... Right? |
I would expect all necessary tricks in https://github.com/oxplot/pdfrankenstein
|
Disturbingly enough, going through svg instead of postscript might actually be better since svg doesn't specify a limit on supported raster formats or color profiles and should be able accommodate a transparent text layer... Frankenstein indeed. Putting that aside, I meant to ask what would it take to modify OCRMyPDF itself so it preserve a PDF/A as a PDF/A so it will still pass veraPDF. According to their docs, it's something pikepdf/qpdf are able to do and, without looking into the specs, I'm assuming adding a text layer to a pdf shouldn't break PDF/A. So, I'm guessing OCRMyPDF will only need to avoid stuff like the linearization to pass veraPDF? p.s. @jbarlow83 Feel free to close the issue if you feel it's appropriate. |
If an input document is already a valid PDF/A, and we're only adding the text layer, and we're not preprocessing images, we could probably keep it a PDF/A without passing through Ghostscript. It's a special case, but it seems like a worthwhile one....
Linearization is allowed in PDF/A if the PDF is 1.5 or above, IIRC. |
Actually, it turns OCRmyPDF already will preserve PDF/A if the input is valid PDF/A and (counterintuitively) |
I haven't had much luck trying to install veraPDF so I'll just take your word for it and say congrats :)
I think a note in the docs would be more than enough. Besides, unless you want to bring in veraPDF as a dependency, trusting the pdf/a meta tag is probably not a good idea. |
I didn’t install veraPDF. I just uploaded the PDF for an online check.
I don’t know whether that check exists in open source.
|
There is a validator in pdfbox:
https://pdfbox.apache.org/1.8/cookbook/pdfavalidation.html
|
@rmast veraPDF is FOSS and optionally uses PDFbox: https://docs.verapdf.org/develop/#license You can also just run the .jar off their installer like the Arch AUR package does: https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=verapdf |
I believe it should be running the image through pngquant instead at optimize level 1.
Though this might also be the pdf format changing to the archival specs...
p.s. forgot to mention the png-to-jpeg bug also happens with some compressed pngs but I haven't bothered trying to replicate this since I believe it should never try to convert bitmap images to jpegs to begin with.
The text was updated successfully, but these errors were encountered: