Wrong text bounding boxes from the pdf parser #813

petrgronat · 2025-01-27T09:17:39Z

Bug

For some PDFs (see attached samples), the BaseOcrModel.get_ocr_rects() method gets wrong bounding boxes. Whel I look at the bounding box, it looks like a vertical strip crossing the text in the middle of the line (see attached screenshot from EasyOcrMode.__call__() high_res_image.show() ).

As a result (on exo_pg1.pdf), the markdown is a pile of garbage. On the other hand, when I created very similar slide in google doc and exported to pdf (see exo_synth.pdf) the document is parsed normally.
...

exo_pg1.pdf
exo_synth.pdf

Steps to reproduce

from docling.datamodel.pipeline_options import PdfPipelineOptions, EasyOcrOptions, OcrEngine
from docling.document_converter import DocumentConverter, PdfFormatOption, InputFormat
from pathlib import Path
filename_pdf = Path("~/exo_pg1.pdf")

# 1. Configure your OCR options
ocr_options = EasyOcrOptions(
    lang=["en"],            # list of language codes (EasyOCR or Tesseract)
)

# 2. Configure your PDF pipeline options
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = ocr_options  # attach your OCR config here

# Additional pipeline options (table structure, scaling, etc.)
pipeline_options.do_table_structure = True
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = True
# ... etc.

# 3. Pass these pipeline options into the PdfFormatOption
doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

# 4. Run the conversion
conv_res = doc_converter.convert(filename_pdf)
doc = conv_res.document
md = doc.export_to_markdown()
print(md)

Docling version

docling==2.15.1
docling-core==2.15.1
docling-ibm-models==3.2.1
docling-parse==3.1.1

Python version

Python 3.12.8

The text was updated successfully, but these errors were encountered:

PeterStaar-IBM · 2025-01-28T06:44:51Z

@petrgronat Thanks, we will have a look at it asap!

cau-git · 2025-01-29T12:42:12Z

Probably related issue: DS4SD/docling-parse#81

PeterStaar-IBM · 2025-01-30T10:43:10Z

@petrgronat Will be resolved after this PR is included (DS4SD/docling-parse#90).

petrgronat added the bug Something isn't working label Jan 27, 2025

PeterStaar-IBM self-assigned this Jan 28, 2025

PeterStaar-IBM added the pdf parsing PDF issue related to docling-parse label Jan 28, 2025

vagenas added enhancement New feature or request and removed enhancement New feature or request labels Jan 30, 2025

PeterStaar-IBM assigned cau-git Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong text bounding boxes from the pdf parser #813

Wrong text bounding boxes from the pdf parser #813

petrgronat commented Jan 27, 2025

PeterStaar-IBM commented Jan 28, 2025

cau-git commented Jan 29, 2025

PeterStaar-IBM commented Jan 30, 2025

Wrong text bounding boxes from the pdf parser #813

Wrong text bounding boxes from the pdf parser #813

Comments

petrgronat commented Jan 27, 2025

Bug

Steps to reproduce

Docling version

Python version

PeterStaar-IBM commented Jan 28, 2025

cau-git commented Jan 29, 2025

PeterStaar-IBM commented Jan 30, 2025