This is a bench on RD Table bench (https://huggingface.co/datasets/reducto/rd-tablebench) for the Chunkr AI via our new API. The results in this repo and blog are outdated by a few months - so we are publishing our results here. On the new implementation, we scored 0.81 (on 500 randomly sampled tables from the rd-table benchmark dataset - last run on Feb 7th 2025) compared to the <0.65 on the original benchmark run. We are publishing our results on the dataset here.
We ran the dataset with the following configuration on our API:
config = Configuration(
ocr_strategy=OcrStrategy.ALL,
segmentation_strategy=SegmentationStrategy.PAGE,
segment_processing=SegmentProcessing(page=GenerationConfig(html=GenerationStrategy.LLM)),
high_resolution=True
)
-
String matching for table grading has significant limitations:
- Fails to capture the visual and semantic structure of tables
- Cannot evaluate layout and formatting that conveys meaning
- Relies on brittle heuristics that break on valid variations (such as
edge_case_002.jpg
)
-
The scoring criteria in grading.py (S_ROW_MATCH, G_ROW, etc.) (in
grading.py
in their github repo) may not reflect real-world table quality. For example, intelligently merged columns that maintain semantic meaning are heavily penalized, even though they may be perfectly valid for downstream LLM tasks. The code only strips newlines, hyphens, and whitespace, but doesn't handle other HTML formatting like: -
<b>
or<strong>
tags for bold text -
<i>
or<em>
tags for italics -
<sup>
and<sub>
for superscript/subscript Such differences are not normalized but are not consistent across the dataset either - which skews results (seeedge_case_001.png
). -
Character Normalization Issues:
- No handling of special characters or their HTML entities (e.g.,
&
vs &) - No Unicode normalization (e.g., composed vs decomposed characters)
- No handling of special characters or their HTML entities (e.g.,
The ground truth is also not always representative of the tables. Despite Gemini-Pro-1.5 (what we use for tables) scoring lower than Reducto, in practice it is actually better some of the ground truth samples. I've saved one such example to the edge_case_001.png
file. The ground truth incorrectly centers the title and does not add bold tags to the first column's rows. We get penalized for doing this correctly. Another evaluator (who unfortunately has outdated chunkr results) also independently found discrepancies in the original benchmark (https://www.sergey.fyi/articles/gemini-flash-2 - listed in Footnotes [1]). It is likely that a higher score on this benchmark (>85%) correlates with worse real-world performance/indication of overfitting on the eval.