Measuring massive multitask language understanding in Chinese

简体中文 | English

Introduction

CMMLU is a comprehensive evaluation benchmark specifically designed to evaluate the knowledge and reasoning abilities of LLMs within the context of Chinese language and culture. CMMLU covers a wide range of subjects, comprising 67 topics that span from elementary to advanced professional levels. It includes subjects that require computational expertise, such as physics and mathematics, as well as disciplines within humanities and social sciences. Many of these tasks are not easily translatable from other languages due to their specific contextual nuances and wording. Furthermore, numerous tasks within CMMLU have answers that are specific to China and may not be universally applicable or considered correct in other regions or languages.

Note: if you need Ancient Chiense Evaluation, please refer to ACLUE.

Leaderboard

The following table displays the performance of models in the five-shot and zero-shot settings.

Five-shot

Model	STEM	Humanities	Social Science	Other	China-specific	Average
Open Access Models
Lingzhi-72B-chat	84.82	92.93	91.25	92.64	90.89	90.26
Spark 4.0-2024-10-14	84.75	93.53	90.64	91.03	90.09	90.07
Qwen2-72B	82.80	93.84	90.38	92.71	90.60	89.65
Jiutian-大模型	80.58	93.33	89.81	91.79	89.8	88.59
Qwen1.5-110B	81.59	92.41	89.14	91.19	89.02	88.32
JIUTIAN-57B	79.79	91.99	88.57	90.27	88.02	87.39
Qwen2.5-72B	80.35	88.41	85.96	86.06	88.91	85.67
Qwen1.5-72B	76.83	88.37	84.15	86.06	83.77	83.54
PCI-TransGPT	76.85	86.46	81.65	84.57	82.85	82.46
Qwen1.5-32B	76.25	86.31	83.42	83.82	82.84	82.25
BlueLM-7B	61.36	79.83	77.80	78.89	76.74	74.27
Qwen1.5-7B	63.64	76.42	74.69	75.91	73.43	72.50
XuanYuan-70B	60.74	77.79	75.47	70.81	70.92	71.10
GPT4	65.23	72.11	72.06	74.79	66.12	70.95
Llama-3.1-70B-Instruct	55.05	66.62	66.08	70.50	61.65	64.38
XuanYuan-13B	50.07	66.32	64.11	59.99	60.55	60.05
Qwen-7B	48.39	63.77	61.22	62.14	58.73	58.66
ZhiLu-13B	44.26	61.54	60.25	61.14	57.14	57.16
ChatGPT	47.81	55.68	56.50	62.66	50.69	55.51
Baichuan-13B	42.38	61.61	60.44	59.26	56.62	55.82
ChatGLM2-6B	42.55	50.98	50.99	50.80	48.37	48.80
Baichuan-7B	35.25	48.07	47.88	46.61	44.14	44.43
Falcon-40B	33.33	43.46	44.28	44.75	39.46	41.45
LLaMA-65B	34.47	40.24	41.55	42.88	37.00	39.80
ChatGLM-6B	32.35	39.22	39.65	38.62	37.70	37.48
BatGPT-15B	34.96	35.45	36.31	42.14	37.89	37.16
BLOOMZ-7B	30.56	39.10	38.59	40.32	37.15	37.04
Llama-3-70B-Instruct	30.10	39.38	32.93	48.05	37.17	36.85
Chinese-LLaMA-13B	27.12	33.18	34.87	35.10	32.97	32.63
Bactrian-LLaMA-13B	27.52	32.47	32.27	35.77	31.56	31.88
MOSS-SFT-16B	27.23	30.41	28.84	32.56	28.68	29.57
Models with Limited Access
BlueLM	78.16	90.50	86.88	87.87	87.55	85.59
Mind GPT	76.76	87.09	83.74	84.70	81.82	82.84
ZW-LM	72.68	85.84	83.61	85.68	82.71	81.73
QuarkLLM	70.97	85.20	82.88	82.71	81.12	80.27
Galaxy	69.61	74.95	78.54	77.93	73.99	74.03
Mengzi-7B	49.59	75.27	71.36	70.52	69.23	66.41
KwaiYii-13B	46.54	69.22	64.49	65.09	63.10	61.73
MiLM-6B	46.85	61.12	61.68	58.84	59.39	57.17
MiLM-1.3B	35.59	49.58	49.03	47.56	48.17	45.39
Random	25.00	25.00	25.00	25.00	25.00	25.00

Zero-shot

Model	STEM	Humanities	Social Science	Other	China-specific	Average
Open Access Models
Spark 4.0-2024-10-14	87.36	93.97	90.03	92.71	90.4	90.97
Lingzhi-72B-chat	84.85	92.99	90.75	92.47	90.68	90.07
Qwen1.5-110B	80.84	91.51	89.01	89.99	88.64	87.64
Qwen2-72B	80.92	90.90	87.93	91.23	87.24	87.47
Qwen2.5-72B	80.67	87.00	84.66	87.35	83.21	84.70
PCI-TransGPT	76.69	86.26	81.71	84.47	83.13	82.44
Qwen1.5-72B	75.07	86.15	83.06	83.84	82.78	81.81
Qwen1.5-32B	74.82	85.13	82.49	84.34	82.47	81.47
BlueLM-7B	62.08	81.29	79.38	79.56	77.69	75.40
Qwen1.5-7B	62.87	74.90	72.65	74.64	71.94	71.05
XuanYuan-70B	61.21	76.25	74.44	70.67	69.35	70.59
Llama-3.1-70B-Instruct	61.60	71.44	69.42	74.72	63.79	69.01
GPT4	63.16	69.19	70.26	73.16	63.47	68.90
Llama-3-70B-Instruct	57.02	67.87	68.67	73.95	62.96	66.74
XuanYuan-13B	50.22	67.55	63.85	61.17	61.50	60.51
Qwen-7B	46.33	62.54	60.48	61.72	58.77	57.57
ZhiLu-13B	43.53	61.60	61.40	60.15	58.97	57.14
ChatGPT	44.80	53.61	54.22	59.95	49.74	53.22
Baichuan-13B	42.04	60.49	59.55	56.60	55.72	54.63
ChatGLM2-6B	41.28	52.85	53.37	52.24	50.58	49.95
BLOOMZ-7B	33.03	45.74	45.74	46.25	41.58	42.80
Baichuan-7B	32.79	44.43	46.78	44.79	43.11	42.33
ChatGLM-6B	32.22	42.91	44.81	42.60	41.93	40.79
BatGPT-15B	33.72	36.53	38.07	46.94	38.32	38.51
Falcon-40B	31.11	41.30	40.87	40.61	36.05	38.50
LLaMA-65B	31.09	34.45	36.05	37.94	32.89	34.88
Bactrian-LLaMA-13B	26.46	29.36	31.81	31.55	29.17	30.06
Chinese-LLaMA-13B	26.76	26.57	27.42	28.33	26.73	27.34
MOSS-SFT-16B	25.68	26.35	27.21	27.92	26.70	26.88
Models with Limited Access
BlueLM	76.36	90.34	86.23	86.94	86.84	84.68
DiMind	70.92	86.66	86.04	86.60	81.49	82.73
云天天书	73.03	83.78	82.30	84.04	81.37	80.62
Mind GPT	71.20	83.95	80.59	82.11	78.90	79.20
QuarkLLM	67.23	81.69	79.47	80.74	77.00	77.08
Galaxy	69.38	75.33	78.27	78.19	73.25	73.85
ZW-LM	63.93	77.95	76.28	72.99	72.94	72.74
KwaiYii-66B	55.20	77.10	71.74	73.30	71.27	69.96
Mengzi-7B	49.49	75.84	72.32	70.87	70.00	66.88
KwaiYii-13B	46.82	69.35	63.42	64.02	63.26	61.22
MiLM-6B	48.88	63.49	66.20	62.14	62.07	60.37
MiLM-1.3B	40.51	54.82	54.15	53.99	52.26	50.79
Random	25.00	25.00	25.00	25.00	25.00	25.00

How to submit

For open-source/API models, open pull request to update the result (you can also provide test code in src folder).
For not open-source/API models, update results in the cooresponding part and open pull request.

Data

We provide our dataset according to each subject in data folder. You can also access our dataset via Hugging Face.

Quick Use

Our dataset has been added to lm-evaluation-harness and OpenCompass, you can evaluate your model via these open-source tools.

Data Format

Each question in the dataset is a multiple-choice questions with 4 choices and only one choice as the correct answer. The data is comma saperated .csv file. Here is an example:

    同一物种的两类细胞各产生一种分泌蛋白，组成这两种蛋白质的各种氨基酸含量相同，但排列顺序不同。其原因是参与这两种蛋白质合成的,tRNA种类不同,同一密码子所决定的氨基酸不同,mRNA碱基序列不同,核糖体成分不同,C
    Translation:"Two types of cells within the same species each produce a secretion protein. The various amino acids that make up these two proteins have the same composition but differ in their arrangement. The reason for this difference in arrangement in the synthesis of these two proteins is,Different types of tRNA,Different amino acids determined by the same codon,Different mRNA base sequences,Different ribosome components,C"

Prompt

We provide the preprocessing code in src/mp_utils directory. It includes apporach we used to generate direct answer prompt and chain-of-thought (COT) prompt.

Here is an example of data after adding direct answer prompt:

    以下是关于(高中生物)的单项选择题，请直接给出正确答案的选项。
    (Here are some single-choice questions about(high school biology), please provide the correct answer choice directly.)
    题目：同一物种的两类细胞各产生一种分泌蛋白，组成这两种蛋白质的各种氨基酸含量相同，但排列顺序不同。其原因是参与这两种蛋白质合成的：
    (Two types of cells within the same species each produce a secretion protein. The various amino acids that make up these two proteins have the same composition but differ in their arrangement. The reason for this difference in arrangement in the synthesis of these two proteins is)
    A. tRNA种类不同(Different types of tRNA)
    B. 同一密码子所决定的氨基酸不同(Different amino acids determined by the same codon)
    C. mRNA碱基序列不同(Different mRNA base sequences)
    D. 核糖体成分不同(Different ribosome components)
    答案是：C(Answer: C)

    ... [other examples] 

    题目：某种植物病毒V是通过稻飞虱吸食水稻汁液在水稻间传播的。稻田中青蛙数量的增加可减少该病毒在水稻间的传播。下列叙述正确的是：
    (Question: A certain plant virus, V, is transmitted between rice plants through the feeding of rice planthoppers. An increase in the number of frogs in the rice field can reduce the spread of this virus among the rice plants. The correct statement among the options provided would be)
    A. 青蛙与稻飞虱是捕食关系(Frogs and rice planthoppers have a predatory relationship)
    B. 水稻和病毒V是互利共生关系(Rice plants and virus V have a mutualistic symbiotic relationship)
    C. 病毒V与青蛙是寄生关系(Virus V and frogs have a parasitic relationship)
    D. 水稻与青蛙是竞争关系(Rice plants and frogs have a competitive relationship)
    答案是： (Answer:)

For the COT prompt we modified the prompt from“请直接给出正确答案的选项 (please provide the correct answer choice directly)” to “逐步分析并选出正确答案 (Analyze step by step and select the correct answer).”

Evaluation

The code for evaluation of each model we used is in src, and the code to run them is listed in script directory.

Citation

@misc{li2023cmmlu,
      title={CMMLU: Measuring massive multitask language understanding in Chinese}, 
      author={Haonan Li and Yixuan Zhang and Fajri Koto and Yifei Yang and Hai Zhao and Yeyun Gong and Nan Duan and Timothy Baldwin},
      year={2023},
      eprint={2306.09212},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

The CMMLU dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_EN.md

README_EN.md

Measuring massive multitask language understanding in Chinese

简体中文 | English

Introduction

Leaderboard

How to submit

Data

Quick Use

Data Format

Prompt

Evaluation

Citation

License

Files

README_EN.md

Latest commit

History

README_EN.md

File metadata and controls

Measuring massive multitask language understanding in Chinese

简体中文 | English

Introduction

Leaderboard

How to submit

Data

Quick Use

Data Format

Prompt

Evaluation

Citation

License