Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: Deduplicate images that match any tables or charts #389

Open
edknv opened this issue Jan 30, 2025 · 0 comments
Open

[FEA]: Deduplicate images that match any tables or charts #389

edknv opened this issue Jan 30, 2025 · 0 comments
Labels
feature request New feature or request

Comments

@edknv
Copy link
Collaborator

edknv commented Jan 30, 2025

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Significant improvement

Please provide a clear description of problem this feature solves

When ingesting PDFs using nv-ingest, we often see duplicate extractions for the same visual element. For example, a chart might be detected both as a “chart” (via the YOLOX-based detection pipeline) and as an “image” (extracted directly from the PDF’s raw data). This leads to repeated entries referencing effectively the same content. In other cases, a large single figure is decomposed into smaller images, and then also recognized via the chart/table detection, further duplicating results.

Users have reported that in research papers and technical PDFs, these duplicates clutter downstream applications (like RAG or multi-modal analytics) with multiple references to the same entity, forcing a manual deduplication step.

Describe the feature, and optionally a solution or implementation and any alternatives

The deup stage in the pipeline may be a good place to implement post-processing to deduplicate or merge these overlapping detections. One alternative is to check Overlapping Bounding Boxes: If a bounding box from the “raw PDF extraction” phase substantially overlaps or exactly covers the bounding box from a YOLOX chart/table detection, treat them as one entity. We would likely want to keep the tables and charts and drop the images if they are duplicates.

Additional context

No response

@edknv edknv added the feature request New feature or request label Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant