Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignoring Images When Converting from PDF to MD #794

Open
sallahbaksh opened this issue Jan 23, 2025 · 4 comments
Open

Ignoring Images When Converting from PDF to MD #794

sallahbaksh opened this issue Jan 23, 2025 · 4 comments
Assignees
Labels
layout question Further information is requested

Comments

@sallahbaksh
Copy link

Question

Is there a way to ignore images when converting from PDF to Markdown? If a PDF contains many images, the conversion process becomes very slow, sometimes taking over an hour. Any guidance on optimizing this or skipping images would be greatly appreciated.

@sallahbaksh sallahbaksh added the question Further information is requested label Jan 23, 2025
@PeterStaar-IBM
Copy link
Contributor

Can you give us an example?

@sallahbaksh
Copy link
Author

sallahbaksh commented Jan 27, 2025

I've attached a pdf that takes over an hour to convert from pdf to md:
Whitestown-UDO-Adopted-2020-06-12_Amended-November-2023 1.pdf

@PeterStaar-IBM
Copy link
Contributor

@sallahbaksh Thanks a lot, let me do some investigation, but at first glance, this looks like the model gets confused from the page furniture (left and right) and starts to interprete all as a table (making it slow).

I think that with this example, we can robustify the layout model. Let us work on that!

@PeterStaar-IBM PeterStaar-IBM added pdf parsing PDF issue related to docling-parse and removed pdf parsing PDF issue related to docling-parse labels Jan 28, 2025
@cau-git cau-git removed the pdf parsing PDF issue related to docling-parse label Jan 29, 2025
@vagenas vagenas added enhancement New feature or request and removed enhancement New feature or request labels Jan 30, 2025
@cau-git cau-git added the layout label Jan 31, 2025
@cau-git cau-git self-assigned this Jan 31, 2025
@cau-git
Copy link
Contributor

cau-git commented Jan 31, 2025

@sallahbaksh I tried converting the sample document, and it indeed takes a long time. However, it comes out fine in the end. The reason I suspect is that the high frequency of tables causes the slow-down, since table-structure inference is expensive, and longer runtimes must be expected. You can verify if you get fast results by disabling the tables . The included images should have no effect on the speed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
layout question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants