Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect Bounding Boxes for PDFs #3867

Open
charlottecrnj opened this issue Jan 16, 2025 · 0 comments
Open

Incorrect Bounding Boxes for PDFs #3867

charlottecrnj opened this issue Jan 16, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@charlottecrnj
Copy link

Describe the bug
The bounding boxes returned by the HI_RES strategy are wrong for PDFs.

To Reproduce

filename = "example.pdf"
with open(filename, "rb") as f:
    data = f.read()

req = operations.PartitionRequest(
    partition_parameters=shared.PartitionParameters(
        files=shared.Files(
            content=data,
            file_name=filename,
        ),
        strategy=shared.Strategy.HI_RES,  
        coordinates = True,
        languages=['de'],
    ),
)

try:
    res = client.general.partition(request=req)
    print(res.elements[0])
except Exception as e:
    print(e)

Expected behavior
I would expect the bounding boxes to be correctly placed around each of the elements returned by the unstructured API.

Screenshots of Actual (Wrong) Behavior
Image

Additional context
This issue was already discussed in a previous issue (#3100 ) Back then, the default strategy would still return bounding boxes. This does not seem to the the case anymore - all strategies except for hi_res return no coordinates (and hence no bounding boyes anymore). Hence there currently is no way to retrieve proper bounding boxes for PDFs?

Does anyone know a way to retrieve correct bounding boxes?

@charlottecrnj charlottecrnj added the bug Something isn't working label Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant