Ignoring Images When Converting from PDF to MD #794

sallahbaksh · 2025-01-23T18:50:19Z

Question

Is there a way to ignore images when converting from PDF to Markdown? If a PDF contains many images, the conversion process becomes very slow, sometimes taking over an hour. Any guidance on optimizing this or skipping images would be greatly appreciated.

PeterStaar-IBM · 2025-01-26T07:31:43Z

Can you give us an example?

sallahbaksh · 2025-01-27T16:35:12Z

I've attached a pdf that takes over an hour to convert from pdf to md:
Whitestown-UDO-Adopted-2020-06-12_Amended-November-2023 1.pdf

PeterStaar-IBM · 2025-01-28T06:42:47Z

@sallahbaksh Thanks a lot, let me do some investigation, but at first glance, this looks like the model gets confused from the page furniture (left and right) and starts to interprete all as a table (making it slow).

I think that with this example, we can robustify the layout model. Let us work on that!

cau-git · 2025-01-31T11:55:47Z

@sallahbaksh I tried converting the sample document, and it indeed takes a long time. However, it comes out fine in the end. The reason I suspect is that the high frequency of tables causes the slow-down, since table-structure inference is expensive, and longer runtimes must be expected. You can verify if you get fast results by disabling the tables . The included images should have no effect on the speed.

sallahbaksh added the question Further information is requested label Jan 23, 2025

PeterStaar-IBM added pdf parsing PDF issue related to docling-parse and removed pdf parsing PDF issue related to docling-parse labels Jan 28, 2025

cau-git removed the pdf parsing PDF issue related to docling-parse label Jan 29, 2025

vagenas added enhancement New feature or request and removed enhancement New feature or request labels Jan 30, 2025

cau-git added the layout label Jan 31, 2025

cau-git self-assigned this Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignoring Images When Converting from PDF to MD #794

Ignoring Images When Converting from PDF to MD #794

sallahbaksh commented Jan 23, 2025

PeterStaar-IBM commented Jan 26, 2025

sallahbaksh commented Jan 27, 2025 •

edited

Loading

PeterStaar-IBM commented Jan 28, 2025

cau-git commented Jan 31, 2025

Ignoring Images When Converting from PDF to MD #794

Ignoring Images When Converting from PDF to MD #794

Comments

sallahbaksh commented Jan 23, 2025

Question

PeterStaar-IBM commented Jan 26, 2025

sallahbaksh commented Jan 27, 2025 • edited Loading

PeterStaar-IBM commented Jan 28, 2025

cau-git commented Jan 31, 2025

sallahbaksh commented Jan 27, 2025 •

edited

Loading