Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve backend resolution logic #802

Open
vagenas opened this issue Jan 24, 2025 · 4 comments
Open

Improve backend resolution logic #802

vagenas opened this issue Jan 24, 2025 · 4 comments
Labels
enhancement New feature or request

Comments

@vagenas
Copy link
Contributor

vagenas commented Jan 24, 2025

Requested feature

Document conversion currently contains a logic for "guessing" / resolving the backend to use for a given input (ref).

This logic has some limitations, e.g. when working with streams, it relies on the first 8KB to detect the backend to use — which may or may not be enough for a correct detection (e.g. deciding info could only appear at the end of a 10KB stream).

Consider ways to remove these limitations.

One possible high-level approach to examine could be to:

  • remove the current layer of "guessing" a backend a priori and then committing to that guess, and
  • instead, keep for each format, e.g. XML, a list of backends to try one after another, until one successfully parses (can have a default list, parametrizable by the user).
@vagenas vagenas added the enhancement New feature or request label Jan 24, 2025
@vagenas vagenas changed the title Improve backend resolution Improve backend resolution logic Jan 24, 2025
@vagenas vagenas assigned vagenas and unassigned vagenas Jan 27, 2025
@dolfim-ibm
Copy link
Contributor

dolfim-ibm commented Jan 29, 2025

It turns out also the filetype library is loading only 8K bytes ref, so this happens also in file inputs.

@dolfim-ibm dolfim-ibm marked this as a duplicate of #542 Jan 29, 2025
@dolfim-ibm
Copy link
Contributor

As discovered in #542, some MS Office XML archives have the meta file [Content_Types].xml at the end, which is not captured by the 8K bytes signature.

One way of improving the logic could be:

  1. Detect if the file is a zip archive (here filetype should work)
  2. List all the files in there and check if [Content_Types].xml is present
  3. In case, read it and infer the proper file type from it. Since zip archives allow random access, this could be more efficient than reading the whole file.

@cau-git
Copy link
Contributor

cau-git commented Jan 31, 2025

Another sample of a word document not detected as such is seen in issue #476.

@cau-git cau-git marked this as a duplicate of #476 Jan 31, 2025
@dward4
Copy link

dward4 commented Feb 3, 2025

I'm seeing this issue for pptx files where [Content_Types].xml is present at the top, for example, this slide deck, which I've ran zipinfo on to display that [Content_Types].xml does indeed sit at the top as expected, but I've truncated the rest of the zipinfo output to clean this post up. Below that is [Content_Types].xml for the file, and I've also included a [Content_Types].xml of a similar pptx file that does process properly. Happy to help test any fixes.

Archive: name_scrubbed.pptx Zip file size: 4291099 bytes, number of entries: 115 -rw---- 1.0 fat 8628 b- defS 80-Jan-01 00:00 [Content_Types].xml ... 115 files, 4964777 bytes uncompressed, 4274329 bytes compressed: 13.9%

And the subsequent [Content_Types].xml

<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types"> <Default Extension="emf" ContentType="image/x-emf"/> <Default Extension="jpeg" ContentType="image/jpeg"/> <Default Extension="jpg" ContentType="image/jpeg"/> <Default Extension="png" ContentType="image/png"/> <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/> <Default Extension="xml" ContentType="application/xml"/> <Override PartName="/ppt/presentation.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.presentation.main+xml"/> <Override PartName="/ppt/slideMasters/slideMaster1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideMaster+xml"/> <Override PartName="/ppt/slides/slide1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slide+xml"/> <Override PartName="/ppt/slides/slide2.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slide+xml"/> <Override PartName="/ppt/slides/slide3.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slide+xml"/> <Override PartName="/ppt/notesMasters/notesMaster1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.notesMaster+xml"/> <Override PartName="/ppt/handoutMasters/handoutMaster1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.handoutMaster+xml"/> <Override PartName="/ppt/presProps.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.presProps+xml"/> <Override PartName="/ppt/viewProps.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.viewProps+xml"/> <Override PartName="/ppt/theme/theme1.xml" ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/> <Override PartName="/ppt/tableStyles.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.tableStyles+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout2.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout3.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout4.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout5.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout6.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout7.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout8.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout9.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout10.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout11.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout12.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout13.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout14.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout15.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout16.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout17.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout18.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout19.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout20.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout21.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout22.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout23.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout24.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout25.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout26.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout27.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout28.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout29.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout30.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout31.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout32.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout33.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout34.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout35.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout36.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout37.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout38.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout39.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout40.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/theme/theme2.xml" ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/> <Override PartName="/ppt/theme/theme3.xml" ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/> <Override PartName="/ppt/changesInfos/changesInfo1.xml" ContentType="application/vnd.ms-powerpoint.changesinfo+xml"/> <Override PartName="/ppt/revisionInfo.xml" ContentType="application/vnd.ms-powerpoint.revisioninfo+xml"/> <Override PartName="/docProps/core.xml" ContentType="application/vnd.openxmlformats-package.core-properties+xml"/> <Override PartName="/docProps/app.xml" ContentType="application/vnd.openxmlformats-officedocument.extended-properties+xml"/> </Types>

Finally, I hope this is helpful, here is a [Content_Types].xml for a similar powerpoint that does process properly

<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types"> <Default Extension="emf" ContentType="image/x-emf"/> <Default Extension="fntdata" ContentType="application/x-fontdata"/> <Default Extension="jpeg" ContentType="image/jpeg"/> <Default Extension="jpg" ContentType="image/jpeg"/> <Default Extension="png" ContentType="image/png"/> <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/> <Default Extension="xml" ContentType="application/xml"/> <Override PartName="/ppt/presentation.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.presentation.main+xml"/> <Override PartName="/ppt/slideMasters/slideMaster1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideMaster+xml"/> <Override PartName="/ppt/slideMasters/slideMaster2.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideMaster+xml"/> <Override PartName="/ppt/slides/slide1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slide+xml"/> <Override PartName="/ppt/slides/slide2.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slide+xml"/> <Override PartName="/ppt/notesMasters/notesMaster1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.notesMaster+xml"/> <Override PartName="/ppt/handoutMasters/handoutMaster1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.handoutMaster+xml"/> <Override PartName="/ppt/presProps.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.presProps+xml"/> <Override PartName="/ppt/viewProps.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.viewProps+xml"/> <Override PartName="/ppt/theme/theme1.xml" ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/> <Override PartName="/ppt/tableStyles.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.tableStyles+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout2.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout3.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout4.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout5.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout6.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout7.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout8.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout9.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout10.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout11.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout12.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout13.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout14.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout15.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout16.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout17.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout18.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout19.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout20.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout21.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout22.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout23.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout24.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout25.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout26.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout27.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout28.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout29.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout30.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout31.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout32.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout33.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout34.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout35.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout36.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout37.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout38.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout39.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout40.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout41.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout42.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/theme/theme2.xml" ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/> <Override PartName="/ppt/theme/theme3.xml" ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/> <Override PartName="/ppt/theme/theme4.xml" ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/> <Override PartName="/ppt/notesSlides/notesSlide1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.notesSlide+xml"/> <Override PartName="/ppt/authors.xml" ContentType="application/vnd.ms-powerpoint.authors+xml"/> <Override PartName="/docProps/core.xml" ContentType="application/vnd.openxmlformats-package.core-properties+xml"/> <Override PartName="/docProps/app.xml" ContentType="application/vnd.openxmlformats-officedocument.extended-properties+xml"/> <Override PartName="/docProps/custom.xml" ContentType="application/vnd.openxmlformats-officedocument.custom-properties+xml"/> </Types>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants