Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ImageWriter save FLATE_DECODE image as BMP makes the output file corrupt #916

Closed
dotrunghieu96 opened this issue Nov 8, 2023 · 5 comments

Comments

@dotrunghieu96
Copy link

Bug report

Thanks for finding the bug! To help us fix it, please make sure that you
include the following information:

  • A description of the bug:
    ImageWriter priortized BMP over FLATE_DECODE so some image in my PDF is saved as bmp directly, which caused them to corrupt.

  • Steps to reproduce the bug. Try to minimize the number of steps needed.
    Include the command and/or script that you use. Also include the PDF that
    you use.:

    • Load the LTPage loaded from the PDF Page
    • In the LTPage, find the LTImage and save them with ImageWriter
  • If relevant, include the output and/or error stacktrace: Some FLATE_DECODE image is saved as bmp and cannot be opened

@pietermarsman
Copy link
Member

Hi @dotrunghieu96,

Thanks for the bug report and the corresponding PR.

Could you share a PDF and some code that you use to reproduce this bug? That will allow me to understand the impact of your suggested change better.

@dotrunghieu96
Copy link
Author

Hi @pietermarsman, this is the file that I used
GitGuide.pdf

In code, first I was parsing the LTObjects via pages

from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
# open the pdf file
fp = open(pdf_doc, "rb")
# create a parser object associated with the file object
parser = PDFParser(fp)
# create a PDFDocument object that stores the document structure
doc = PDFDocument(parser)
# connect the parser and document objects
parser.set_document(doc)

Then parse the LTObjects

device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for i, page in enumerate(PDFPage.create_pages(doc)):
    print("parsing pages:", i + 1, flush=True)
    interpreter.process_page(page)
    # receive the LTPage object for this page
    layout = device.get_result()
    for lt_obj in layout:
         if isinstance(lt_obj, LTImage):
            saved_file = save_image(lt_obj, page_number, images_folder)

In save_image, I used the ImageWriter class:

from pdfminer.image import ImageWriter

def save_image(lt_image: LTImage, page_number, images_folder):
    image_writer = ImageWriter(images_folder)
    file_name = image_writer.export_image(lt_image)

The problem here is that the images in the PDF are FLATE_DECODE, but ImageWriter saved them as .bmp image, which corrupt them.

So I moved FLATE_DECODE to a higher priority so that the _save_bytes() method is used first, and saved the image as ".jpg" which have the saved images perfectly viewable.

@pietermarsman
Copy link
Member

pietermarsman commented Jan 16, 2024

I cannot replicate this with the latest version.

Using

python tools/pdf2txt.py ~/Downloads/GitGuide.pdf --output-dir images

I get all the images properly formatted. Some jpg's.

X8

And a bunch as bmp's (converted to jpg so that it can be shown by GitHub).

X44

@pietermarsman
Copy link
Member

Let me know if the issue is still there for you, and we can reopen this issue. In that case, could you specify what you mean by "corrupt"?

@iraykhel
Copy link

Yup, extracting .bmp doesn't work.
Crashes here:
if params and "Predictor" in params:
TypeError: argument of type 'PDFObjRef' is not iterable

If this check is bypassed, extracted .bmp is corrupted.

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTFigure, LTImage
from pdfminer.image import ImageWriter

pdf_path = 'path/to/bmp'
W = ImageWriter('path/to/storage')
pages = extract_pages(pdf_path)
for element in pages.__next__():
    if isinstance(element, LTFigure):
        for sub in element:
            if isinstance(sub, LTImage):
                W.export_image(sub)

bmpsample2.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants