Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msexcel_backend.py doesn’t parse complex Excel tables properly. #834

Open
rafaelsanchezsouza opened this issue Jan 29, 2025 · 0 comments
Labels
bug Something isn't working enhancement New feature or request xlsx issue related to xlsx backend

Comments

@rafaelsanchezsouza
Copy link

Bug

When attempting to open an Excel document with complex tables, Docling fails to extract the tables correctly.

Steps to reproduce

from docling.document_converter import DocumentConverter

source = "./excel-tests.xlsx"  # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())  # output: "## Docling Technical Report[...]"

excel-tests.xlsx

Output

HIGH VOLTAGE SWITCHBOARD
DATA SHEET
MODEL-000
2016
Page 1 of 10
Package no.:
13156456
Doc. no.:
144564
Rev.
A
Power system
1
=(A11+1)
=(A12+1)
=(A13+1)
=(A14+1)
=(A15+1)
Construction
7
=(A18+1)
=(A19+1)
Environmental conditions
=(A20+1)
=(A22+1)
=(A23+1)
=(A24+1)
Arc test
=(A25+1)
Notes
1
2
Rated system voltage
Rated system frequency
No. of phases
System earthing
Earth fault current
Control voltage supply
kV : 130 (131 Um, 132 AC, 133 BIL)
Hz : 134
: 3
: Solidly Earthed
: 135 kA
: 2 x 136V AC UPS 1 x 137V AC normal
A : 135 kA
: 2 x 136V AC UPS 1 x 137V AC normal
Metal-enclosed partition
VT for cable discharging
Voltage and Current measurement
No
Low Power Instrument Transformers
Hazardous area classification
Ambient temp.
Location
Humidity
: Non hazardous
: Min. -5, max. +40
: Indoor
: 100
Converted to 110VDC Converted to 110VDC Converted to 110VDC Converted to 110VDC Converted to 110VDC
Arc test (type test) Arc test (type test) Arc test (type test) Arc test (type test) Arc test (type test)
None None None None None

Docling version

Docling version: 2.17.0
Docling Core version: 2.16.0
Docling IBM Models version: 3.3.0
Docling Parse version: 3.1.2
Python: cpython-310 (3.10.7)
Platform: Windows-10-10.0.19045-SP0

Python version

Python 3.10.7

Final Considerations

I understand that the table is complex, so I would like to know what would be the requirements for an Excel document to work with Docling. Digging into the code, I noticed this:

Hope it helps,

Let me know if you need more information.

Have a nice day!

@rafaelsanchezsouza rafaelsanchezsouza added the bug Something isn't working label Jan 29, 2025
@dolfim-ibm dolfim-ibm added the xlsx issue related to xlsx backend label Jan 30, 2025
@vagenas vagenas added the enhancement New feature or request label Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request xlsx issue related to xlsx backend
Projects
None yet
Development

No branches or pull requests

3 participants