ImageWriter save FLATE_DECODE image as BMP makes the output file corrupt #916

dotrunghieu96 · 2023-11-08T02:56:28Z

Bug report

Thanks for finding the bug! To help us fix it, please make sure that you
include the following information:

A description of the bug:
ImageWriter priortized BMP over FLATE_DECODE so some image in my PDF is saved as bmp directly, which caused them to corrupt.
Steps to reproduce the bug. Try to minimize the number of steps needed.
Include the command and/or script that you use. Also include the PDF that
you use.:
- Load the LTPage loaded from the PDF Page
- In the LTPage, find the LTImage and save them with ImageWriter
If relevant, include the output and/or error stacktrace: Some FLATE_DECODE image is saved as bmp and cannot be opened

pietermarsman · 2023-12-22T20:55:33Z

Thanks for the bug report and the corresponding PR.

Could you share a PDF and some code that you use to reproduce this bug? That will allow me to understand the impact of your suggested change better.

dotrunghieu96 · 2023-12-25T03:55:25Z

Hi @pietermarsman, this is the file that I used
GitGuide.pdf

In code, first I was parsing the LTObjects via pages

from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
# open the pdf file
fp = open(pdf_doc, "rb")
# create a parser object associated with the file object
parser = PDFParser(fp)
# create a PDFDocument object that stores the document structure
doc = PDFDocument(parser)
# connect the parser and document objects
parser.set_document(doc)

Then parse the LTObjects

device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for i, page in enumerate(PDFPage.create_pages(doc)):
    print("parsing pages:", i + 1, flush=True)
    interpreter.process_page(page)
    # receive the LTPage object for this page
    layout = device.get_result()
    for lt_obj in layout:
         if isinstance(lt_obj, LTImage):
            saved_file = save_image(lt_obj, page_number, images_folder)

In save_image, I used the ImageWriter class:

from pdfminer.image import ImageWriter

def save_image(lt_image: LTImage, page_number, images_folder):
    image_writer = ImageWriter(images_folder)
    file_name = image_writer.export_image(lt_image)

The problem here is that the images in the PDF are FLATE_DECODE, but ImageWriter saved them as .bmp image, which corrupt them.

So I moved FLATE_DECODE to a higher priority so that the _save_bytes() method is used first, and saved the image as ".jpg" which have the saved images perfectly viewable.

pietermarsman · 2024-01-16T20:43:00Z

I cannot replicate this with the latest version.

Using

python tools/pdf2txt.py ~/Downloads/GitGuide.pdf --output-dir images

I get all the images properly formatted. Some jpg's.

And a bunch as bmp's (converted to jpg so that it can be shown by GitHub).

pietermarsman · 2024-01-16T20:51:24Z

Let me know if the issue is still there for you, and we can reopen this issue. In that case, could you specify what you mean by "corrupt"?

iraykhel · 2024-02-28T20:55:45Z

Yup, extracting .bmp doesn't work.
Crashes here:
if params and "Predictor" in params:
TypeError: argument of type 'PDFObjRef' is not iterable

If this check is bypassed, extracted .bmp is corrupted.

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTFigure, LTImage
from pdfminer.image import ImageWriter

pdf_path = 'path/to/bmp'
W = ImageWriter('path/to/storage')
pages = extract_pages(pdf_path)
for element in pages.__next__():
    if isinstance(element, LTFigure):
        for sub in element:
            if isinstance(sub, LTImage):
                W.export_image(sub)

bmpsample2.pdf

dotrunghieu96 mentioned this issue Nov 8, 2023

FIX #916: move flate decode to higher priority #917

Closed

5 tasks

pietermarsman closed this as completed Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ImageWriter save FLATE_DECODE image as BMP makes the output file corrupt #916

ImageWriter save FLATE_DECODE image as BMP makes the output file corrupt #916

dotrunghieu96 commented Nov 8, 2023

pietermarsman commented Dec 22, 2023

dotrunghieu96 commented Dec 25, 2023

pietermarsman commented Jan 16, 2024 •

edited

Loading

pietermarsman commented Jan 16, 2024

iraykhel commented Feb 28, 2024

ImageWriter save FLATE_DECODE image as BMP makes the output file corrupt #916

ImageWriter save FLATE_DECODE image as BMP makes the output file corrupt #916

Comments

dotrunghieu96 commented Nov 8, 2023

pietermarsman commented Dec 22, 2023

dotrunghieu96 commented Dec 25, 2023

pietermarsman commented Jan 16, 2024 • edited Loading

pietermarsman commented Jan 16, 2024

iraykhel commented Feb 28, 2024

pietermarsman commented Jan 16, 2024 •

edited

Loading