[FEA]: Deduplicate images that match any tables or charts #389

edknv · 2025-01-30T05:46:44Z

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Significant improvement

Please provide a clear description of problem this feature solves

When ingesting PDFs using nv-ingest, we often see duplicate extractions for the same visual element. For example, a chart might be detected both as a “chart” (via the YOLOX-based detection pipeline) and as an “image” (extracted directly from the PDF’s raw data). This leads to repeated entries referencing effectively the same content. In other cases, a large single figure is decomposed into smaller images, and then also recognized via the chart/table detection, further duplicating results.

Users have reported that in research papers and technical PDFs, these duplicates clutter downstream applications (like RAG or multi-modal analytics) with multiple references to the same entity, forcing a manual deduplication step.

Describe the feature, and optionally a solution or implementation and any alternatives

The deup stage in the pipeline may be a good place to implement post-processing to deduplicate or merge these overlapping detections. One alternative is to check Overlapping Bounding Boxes: If a bounding box from the “raw PDF extraction” phase substantially overlaps or exactly covers the bounding box from a YOLOX chart/table detection, treat them as one entity. We would likely want to keep the tables and charts and drop the images if they are duplicates.

Additional context

No response

edknv added the feature request New feature or request label Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: Deduplicate images that match any tables or charts #389

[FEA]: Deduplicate images that match any tables or charts #389

edknv commented Jan 30, 2025

[FEA]: Deduplicate images that match any tables or charts #389

[FEA]: Deduplicate images that match any tables or charts #389

Comments

edknv commented Jan 30, 2025

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem this feature solves

Describe the feature, and optionally a solution or implementation and any alternatives

Additional context