update website

OpenDFM · Jan 7, 2025 · 8950dd9 · 8950dd9
1 parent 8a8e5be
commit 8950dd9
Show file tree

Hide file tree

Showing 4 changed files with 39 additions and 22 deletions.
diff --git a/README.md b/README.md
@@ -4,14 +4,17 @@
 
 ![MULTI](./docs/static/images/overview.png)
 
-🌐 [Website](https://OpenDFM.github.io/MULTI-Benchmark/) | 📃 [Paper](https://arxiv.org/abs/2402.03173/) | 🤗 [Dataset](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark) | 📮 [Submit](https://opendfm.github.io/MULTI-Benchmark/static/pages/submit.html)
+🌐 [Website](https://OpenDFM.github.io/MULTI-Benchmark/) | 📃 [Paper](https://arxiv.org/abs/2402.03173/) | 🤗 [Dataset](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark) |
+🏆 [Leaderboard](https://opendfm.github.io/MULTI-Benchmark/#leaderboard) | 📮 [Submit](https://opendfm.github.io/MULTI-Benchmark/static/pages/submit.html)
 
 [简体中文](./README_zh.md) | English
 
 </div>
 
 ## 🔥 News
 
+- **[2025.1.7]** We have updated our [leaderboard](https://opendfm.github.io/MULTI-Benchmark/#leaderboard) with the latest results.
+- **[2025.1.2]** We have updated MULTI to v1.3.1.
 - **[2024.3.4]** We have released the [evaluation page](https://OpenDFM.github.io/MULTI-Benchmark/static/pages/submit.html).
 - **[2024.2.19]** We have released the [HuggingFace Page](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark/).
 - **[2024.2.6]** We have published our [paper](https://arxiv.org/abs/2402.03173/) on arXiv.
@@ -20,7 +23,11 @@
 
 ## 📖 Overview
 
-Rapid progress in multimodal large language models (MLLMs) highlights the need to introduce challenging yet realistic benchmarks to the academic community, while existing benchmarks primarily focus on understanding simple natural images and short context. In this paper, we present***MULTI***, as a cutting-edge benchmark for evaluating MLLMs on understanding complex tables and images, and reasoning with long context. **MULTI** provides multimodal inputs and requires responses that are either precise or open-ended, reflecting real-life examination styles. **MULTI** includes over 18,000 questions and challenges MLLMs with a variety of tasks, ranging from formula derivation to image detail analysis and cross-modality reasoning. We also introduce***MULTI-Elite***, a 500-question selected hard subset, and ***MULTI-Extend***, with more than 4,500 external knowledge context pieces. Our evaluation indicates significant potential for MLLM advancement, with GPT-4V achieving a **63.7%** accuracy rate on **MULTI**, in contrast to other MLLMs scoring between **28.5%** and **55.3%**. **MULTI** serves not only as a robust evaluation platform but also paves the way for the development of expert-level AI.
+The rapid development of multimodal large language models (MLLMs) raises the question of how they compare to human performance. While existing datasets often feature synthetic or
+overly simplistic tasks, some models have already surpassed human expert baselines. In this paper, we present **MULTI**, a Chinese multimodal dataset derived from authentic examination
+questions. Comprising over 18,000 carefully selected and refined questions, **MULTI** evaluates models using real-world examination standards, encompassing image-text comprehension,
+complex reasoning, and knowledge recall. Additionally, We also introduce **MULTI-Elite**, a 500-question selected hard subset, and **MULTI-Extend** with more than 4,500 external knowledge
+context pieces for testing in-context learning capabilities. **MULTI** serves not only as a robust evaluation platform but also paves the way for the development of expert-level AI.
 
 ## ⏬ Download
 
@@ -47,7 +54,8 @@ The structure of `./data` should be something like:
 
 ## 📝 How to Evaluate
 
-We provide a unified evaluation framework in `eval`. Each file in `eval/models` contains an evaluator specified to one M/LLM, and implements a `generate_answer` method to receive a question as input and give out the answer of it.
+We provide a unified evaluation framework in `eval`. Each file in `eval/models` contains an evaluator specified to one M/LLM, and implements a `generate_answer` method to receive a question as input
+and give out the answer of it.
 
 ```shell
 cd eval
@@ -57,7 +65,8 @@ python eval.py -l # to list all supported models
 
 ### Environment Preparation Before Usage
 
-Each evaluator requires its unique environment setting, and a universal environment may not work for all evaluators. **Just follow the official guide.** If the corresponding model runs well, then so should it fit in our framework.
+Each evaluator requires its unique environment setting, and a universal environment may not work for all evaluators. **Just follow the official guide.** If the corresponding model runs well, then so
+should it fit in our framework.
 
 You just need to install another two packages to run the evaluation code:
 
@@ -99,14 +108,16 @@ python eval.py \
   --model_dir ../models/Qwen-VL-Chat
 ```
 
-The evaluation script will generate a folder named `results` under the root directory, and the result will be saved in `../results/EXPERIMENT_NAME`. During the evaluation, the script will save checkpoints in `../results/EXPERIMENT_NAME/checkpoints`, you can delete them after the evaluation is done. If the evaluation is interrupted, you can continue from the last checkpoint:
+The evaluation script will generate a folder named `results` under the root directory, and the result will be saved in `../results/EXPERIMENT_NAME`. During the evaluation, the script will save
+checkpoints in `../results/EXPERIMENT_NAME/checkpoints`, you can delete them after the evaluation is done. If the evaluation is interrupted, you can continue from the last checkpoint:
 
 ```shell
 python eval.py \
   --checkpoint_dir ../results/EXPERIMENT_NAME
 ```
 
-Most of the arguments are saved in `../results/EXPERIMENT_NAME/args.json`, so you can continue the evaluation without specifying all the arguments again. Please note that `--api_key` is not saved in `args.json` for security reasons, so you need to specify it again.
+Most of the arguments are saved in `../results/EXPERIMENT_NAME/args.json`, so you can continue the evaluation without specifying all the arguments again. Please note that `--api_key` is not saved in
+`args.json` for security reasons, so you need to specify it again.
 
 ```shell
 python eval.py \
@@ -118,13 +129,15 @@ For more details of arguments, please use `python eval.py -h`, and refer to `arg
 
 ### Add Support for Your Models
 
-It's recommended to read the code of the other given evaluators in `eval/models` before your implementation.  
+It's recommended to read the code of the other given evaluators in `eval/models` before your implementation.
 
-Create `class YourModelEvaluator` and implement `generate_answer(self, question:dict)` to match the design supported in `eval.py` and `eval.sh`, which is anticipated to largely ease the coding process.
+Create `class YourModelEvaluator` and implement `generate_answer(self, question:dict)` to match the design supported in `eval.py` and `eval.sh`, which is anticipated to largely ease the coding
+process.
 
-**Do not forget to add their references into `args.py` for the convenience of usage.** 
+**Do not forget to add their references into `args.py` for the convenience of usage.**
 
-You can execute `model_tester.py` in the `eval` folder to check the correctness of you implementation. Various problems including implementation errors, small bugs in code, and even wrong environment settings may cause failure of the evaluation. The examples provided in the file cover most kinds of cases presented in our benchmark. Feel free to change the code in it to debug your code😊
+You can execute `model_tester.py` in the `eval` folder to check the correctness of you implementation. Various problems including implementation errors, small bugs in code, and even wrong environment
+settings may cause failure of the evaluation. The examples provided in the file cover most kinds of cases presented in our benchmark. Feel free to change the code in it to debug your code😊
 
 ```shell
 python model_tester.py <args> # args are similar to the default settings above
@@ -159,14 +172,15 @@ You need to first prepare a UTF-8 encoded JSON file with the following format:
 }
 ```
 
-If you evaluate the model with our official code, you can simply zip the prediction file `prediction.json` and the configuration file `args.json` in the experiment results folder `. /results/EXPERIMENT_NAME` in `.zip` format.
+If you evaluate the model with our official code, you can simply zip the prediction file `prediction.json` and the configuration file `args.json` in the experiment results folder
+`. /results/EXPERIMENT_NAME` in `.zip` format.
 
 Then, you can submit your result to our [evaluation page](https://opendfm.github.io/MULTI-Benchmark/static/pages/submit.html).
 
 You are also welcomed to pull a request and contribute your code to our evaluation code. We will be very grateful for your contribution!
 
-**[Notice]** Thank you for being so interested in the **MULTI** dataset! If you want to add your model in our leaderboard, please fill in [this questionnaire](https://wj.sjtu.edu.cn/q/89UmRAJn), your information will be kept strictly confidential, so please feel free to fill it out. 🤗
-
+**[Notice]** Thank you for being so interested in the **MULTI** dataset! If you want to add your model in our leaderboard, please fill in [this questionnaire](https://wj.sjtu.edu.cn/q/89UmRAJn), your
+information will be kept strictly confidential, so please feel free to fill it out. 🤗
 
 ## 📑 Citation
 

diff --git a/README_zh.md b/README_zh.md
@@ -4,14 +4,16 @@
 
 ![MULTI](./docs/static/images/overview.png)
 
-🌐 [网站](https://OpenDFM.github.io/MULTI-Benchmark/) | 📃 [论文](https://arxiv.org/abs/2402.03173/) | 🤗 [数据](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark) | 📮 [提交](https://opendfm.github.io/MULTI-Benchmark/static/pages/submit.html)
+🌐 [网站](https://OpenDFM.github.io/MULTI-Benchmark/) | 📃 [论文](https://arxiv.org/abs/2402.03173/) | 🤗 [数据](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark) | 🏆 [榜单](https://opendfm.github.io/MULTI-Benchmark/#leaderboard) | 📮 [提交](https://opendfm.github.io/MULTI-Benchmark/static/pages/submit.html)
 
 简体中文 | [English](./README.md) 
 
 </div>
 
 ## 🔥 新闻
 
+- **[2025.1.7]** 我们更新了最新的[榜单](https://opendfm.github.io/MULTI-Benchmark/#leaderboard)。
+- **[2025.1.2]** 我们更新了MULTI到v1.3.1。
 - **[2024.3.4]** 我们发布了[评测页面](https://opendfm.github.io/MULTI-Benchmark/static/pages/submit.html)。
 - **[2024.2.19]** 我们发布了[HuggingFace页面](https://huggingface.co/datasets/OpenDFM/MULTI-Benchmark/)。
 - **[2024.2.6]** 我们在arXiv上发布了我们的[论文](https://arxiv.org/abs/2402.03173/)。
@@ -20,7 +22,7 @@
 
 ## 📖 介绍
 
-在多模态大型语言模型（MLLMs）迅速进步的背景下，提出具有挑战性且符合现实场景的基准测试变得尤为重要，而现有的基准测试主要关注于理解简单的自然图像和短文本。在本文中，我们介绍了***MULTI***，作为一个前沿的基准测试，用于评测MLLMs在理解复杂的表格和图像、以及进行长文本推理的能力。**MULTI**提供多模态输入，并要求回答是精确的或开放式的，反映了现实生活中的考试风格。**MULTI**包括超过 18,000 个问题，挑战MLLMs进行多种任务，从公式推导到图像细节分析和跨模态推理。我们还引入了***MULTI-Elite***，一个精心挑选的包含500个问题的难题子集，以及***MULTI-Extend***，包含超过 4,500 个外部知识上下文。我们的评测显示了MLLMs进步的巨大潜力，GPT-4V在**MULTI**上的准确率达到了 **63.7%**，而其他MLLMs的得分介于 **28.5%** 和 **55.3%** 之间。**MULTI**不仅作为一个稳健的评测平台，也为专家级AI的发展指明了道路。
+在多模态大型语言模型（MLLMs）快速发展的背景下，如何与人类表现进行比较成为一个重要问题。现有的数据集通常涉及合成的数据或过于简单的任务，而一些模型已经超越了人类专家的基准。本文介绍了**MULTI**，一个源自真实考试问题的中文多模态数据集。**MULTI**包含超过18,000个精心挑选和优化的问题，评估模型在中国现实考试标准下的表现，涵盖了图像-文本理解、复杂推理和知识回忆等方面。此外，我们还引入了**MULTI-Elite**，一个由500个难题组成的精选子集，以及**MULTI-Extend**，一个包含超过4,500个外部知识上下文的数据集，用于测试模型的上下文学习能力。**MULTI**不仅作为一个稳健的评测平台，也为专家级AI的发展指明了道路。
 
 ## ⏬ 下载
 

diff --git a/docs/index.html b/docs/index.html
@@ -194,7 +194,7 @@ <h1 class="title is-1 publication-title">
 
                     <!-- HuggingFace Link -->
                     <span class="link-block">
-                            <a href="https://github.com/OpenDFM/MULTI-Benchmark/tree/main?tab=readme-ov-file#-leaderboard" target="_blank"
+                            <a href="#leaderboard"
                                class="external-link button is-normal is-rounded is-dark">
                                 <span class="icon">
                                     <p style="font-size:20px">&#x1F3C6;</p>
@@ -236,9 +236,10 @@ <h3 class="subtitle is-size-3-tablet has-text-left pb-">
                 overly simplistic tasks, some models have already surpassed human expert baselines. In this paper, we present <b>MULTI</b>, a Chinese multimodal dataset derived from authentic examination
                 questions. Comprising over 18,000 carefully selected and refined questions, <b>MULTI</b> evaluates models using real-world examination standards, encompassing image-text comprehension,
                 complex reasoning, and knowledge recall. Additionally, We also introduce <b>MULTI-Elite</b>, a 500-question selected hard subset, and <b>MULTI-Extend</b> with more than 4,500 external knowledge
-                context pieces for testing in-context learning capabilities. Our evaluation highlights substantial room for MLLM advancement, with Qwen2-VL-72B achieving a <b>76.9%</b> accuracy on <b>MULTI</b> and
-                <b>53.1%</b> on <b>MULTI-Elite</b> leading 25 evaluated models, compared to human expert baselines of <b>86.1%</b> and <b>73.1%</b>. <b>MULTI</b> serves not only as a robust evaluation platform but also paves the way
-                for the development of expert-level AI.
+                context pieces for testing in-context learning capabilities.
+<!--                Our evaluation highlights substantial room for MLLM advancement, with Qwen2-VL-72B achieving a <b>76.9%</b> accuracy on <b>MULTI</b> and-->
+<!--                <b>53.1%</b> on <b>MULTI-Elite</b> leading 25 evaluated models, compared to human expert baselines of <b>86.1%</b> and <b>73.1%</b>. -->
+                <b>MULTI</b> serves not only as a robust evaluation platform but also paves the way for the development of expert-level AI.
             </p>
         </h3>
     </div>
@@ -419,7 +420,7 @@ <h2 class="title is-3" id="leaderboard">Leaderboard</h2>
                         <!-- Table body will be populated dynamically -->
                         </tbody>
                     </table>
-                    <p class="test-desc">Results of different models on the MULTI and MULTI-Elite.
+                    <p class="test-desc"><h4>Results of different models on the MULTI and MULTI-Elite.</h4>
                         <br>T: Pure-text LLM, One: Only one image in the first turn, SI: Single image in each turn, MI: Multiple images in one turn. The <u>underline</u> means the model must have an
                         image as input.
                         <br>JuH: Level of Junior High School, SeH: Level of Senior High School, Uni: Level of University, Driv: Chinese Driving Test, AAT: Chinese Administrative Aptitude Test.

diff --git a/docs/static/js/index.js b/docs/static/js/index.js
@@ -41,10 +41,10 @@ function loadTableData() {
                 // tr.classList.add(row.info.type);
 
                 const nameCell = row.Url && row.Url.trim() !== '' ?
-                    `<a href="${row.Url}"><b>${row.Model}</b>🔗</a>` :
+                    `<a href="${row.Url}" target="_blank"><b>${row.Model}</b>🔗</a>` :
                     `<b>${row.Model}</b>`;
                 const versionCell = row.VersionUrl && row.VersionUrl.trim() !== '' ?
-                    `<a href="${row.VersionUrl}"><b>${row.Version}</b>🤗</a>` :
+                    `<a href="${row.VersionUrl}" target="_blank"><b>${row.Version}</b>🤗</a>` :
                     `<b>${row.Version}</b>`;
 
                 const safeGet = (obj, path, defaultValue = '-') => {