You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is this a new feature, an improvement, or a change to existing functionality?
New Feature
How would you describe the priority of this feature request
Significant improvement
Please provide a clear description of problem this feature solves
When ingesting PDFs using nv-ingest, we often see duplicate extractions for the same visual element. For example, a chart might be detected both as a “chart” (via the YOLOX-based detection pipeline) and as an “image” (extracted directly from the PDF’s raw data). This leads to repeated entries referencing effectively the same content. In other cases, a large single figure is decomposed into smaller images, and then also recognized via the chart/table detection, further duplicating results.
Users have reported that in research papers and technical PDFs, these duplicates clutter downstream applications (like RAG or multi-modal analytics) with multiple references to the same entity, forcing a manual deduplication step.
Describe the feature, and optionally a solution or implementation and any alternatives
The deup stage in the pipeline may be a good place to implement post-processing to deduplicate or merge these overlapping detections. One alternative is to check Overlapping Bounding Boxes: If a bounding box from the “raw PDF extraction” phase substantially overlaps or exactly covers the bounding box from a YOLOX chart/table detection, treat them as one entity. We would likely want to keep the tables and charts and drop the images if they are duplicates.
Additional context
No response
The text was updated successfully, but these errors were encountered:
Is this a new feature, an improvement, or a change to existing functionality?
New Feature
How would you describe the priority of this feature request
Significant improvement
Please provide a clear description of problem this feature solves
When ingesting PDFs using nv-ingest, we often see duplicate extractions for the same visual element. For example, a chart might be detected both as a “chart” (via the YOLOX-based detection pipeline) and as an “image” (extracted directly from the PDF’s raw data). This leads to repeated entries referencing effectively the same content. In other cases, a large single figure is decomposed into smaller images, and then also recognized via the chart/table detection, further duplicating results.
Users have reported that in research papers and technical PDFs, these duplicates clutter downstream applications (like RAG or multi-modal analytics) with multiple references to the same entity, forcing a manual deduplication step.
Describe the feature, and optionally a solution or implementation and any alternatives
The deup stage in the pipeline may be a good place to implement post-processing to deduplicate or merge these overlapping detections. One alternative is to check Overlapping Bounding Boxes: If a bounding box from the “raw PDF extraction” phase substantially overlaps or exactly covers the bounding box from a YOLOX chart/table detection, treat them as one entity. We would likely want to keep the tables and charts and drop the images if they are duplicates.
Additional context
No response
The text was updated successfully, but these errors were encountered: