From d1029db9da04bc71691cc4ab328821b267c5f386 Mon Sep 17 00:00:00 2001 From: Zicheng Zhang <58689334+zzc-1998@users.noreply.github.com> Date: Thu, 18 Jul 2024 13:59:23 +0800 Subject: [PATCH] Update README.md --- README.md | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 7b6a867..6fd8f94 100644 --- a/README.md +++ b/README.md @@ -140,15 +140,29 @@ print(ds["dev"][0]) We test on three close-source API models, GPT-4V-Turbo (`gpt-4-vision-preview`, replacing the no-longer-available *old version* GPT-4V results), Gemini Pro (`gemini-pro-vision`) and Qwen-VL-Plus (`qwen-vl-plus`). Slightly improved compared with the older version, GPT-4V still tops among all MLLMs and almost a junior-level human's performance. Gemini Pro and Qwen-VL-Plus follows behind, still better than best open-source MLLMs (0.65 overall). +Update on [2024/7/18], We are glad to release the new SOTA performance of **BlueImage-GPT** (close-source). +**Perception, A1-Single** |**Participant Name** | yes-or-no | what | how | distortion | others | in-context distortion | in-context others | overall | | - | - | - | - | - | - | -| - | - | | Qwen-VL-Plus (`qwen-vl-plus`) | 0.7574 | 0.7325 | 0.5733| 0.6488 | 0.7324 | 0.6867 | 0.7056 | 0.6893 | +| BlueImage-GPT (from VIVO *New Champion*) | **0.8467** | 0.8351 | **0.7469** | 0.7819 | **0.8594** | 0.7995 | 0.8240 | 0.8107 | | Gemini-Pro (`gemini-pro-vision`) | 0.7221 | 0.7300 |0.6645 | 0.6530 | 0.7291 | 0.7082 | 0.7665 | 0.7058 | | GPT-4V-Turbo (`gpt-4-vision-preview`) |0.7722 | 0.7839 | 0.6645 |0.7101 | 0.7107 | 0.7936 | 0.7891 | 0.7410 | | GPT-4V (*old version*) | 0.7792 | 0.7918 | 0.6268 | 0.7058 | 0.7303 | 0.7466 | 0.7795 | 0.7336 | | human-1-junior | 0.8248 | 0.7939 | 0.6029 | 0.7562 | 0.7208 | 0.7637 | 0.7300 | 0.7431 | -| human-2-senior | **0.8431** | **0.8894** | **0.7202** | **0.7965** | **0.7947** | **0.8390** | **0.8707** | **0.8174** | +| human-2-senior | 0.8431 | **0.8894** | 0.7202 | **0.7965** | 0.7947 | **0.8390** | **0.8707** | **0.8174** | + +**Perception, A2-Single** +|**Participant Name** | yes-or-no | what | how | distortion | others | compare | joint | overall | +| - | - | - | - | - | - | -| - | - | +| Qwen-VL-Plus (`qwen-vl-plus`) | 0.6685 | 0.5579 | 0.5991 | 0.6246 | 0.5877 | 0.6217 | 0.5920 | 0.6148 | +| Qwen-VL-Max (`qwen-vl-max`) | 0.6765 | 0.6756 | 0.6535 | 0.6909 | 0.6118 | 0.6865 | 0.6129 | 0.6699 | +| BlueImage-GPT (from VIVO *New Champion*) | **0.8843** | 0.8033 | **0.7958** | **0.8464** | 0.8062 | 0.8462 | 0.7955 | 0.8348 | +| Gemini-Pro (`gemini-pro-vision`) | 0.6578 | 0.5661 | 0.5674 | 0.6042 | 0.6055 | 0.6046 | 0.6044 | 0.6046 | +| GPT-4V (`gpt-4-vision`) | 0.7975 | 0.6949 | 0.8442 | 0.7732 | 0.7993 | 0.8100 | 0.6800 | 0.7807 | +| Junior-level Human | 0.7811 | 0.7704 | 0.8233 | 0.7817 | 0.7722 | 0.8026 | 0.7639 | 0.8012 | +| Senior-level Human | 0.8300 | **0.8481** | 0.8985 | 0.8313 | **0.9078** | **0.8655** | **0.8225** | **0.8548** | We have also evaluated several new open-source models recently, and will release their results soon.