📄 Paper • 🏆 Leaderboard • 🤗 Dataset
CMMLU is a comprehensive evaluation benchmark specifically designed to evaluate the knowledge and reasoning abilities of LLMs within the context of Chinese language and culture. CMMLU covers a wide range of subjects, comprising 67 topics that span from elementary to advanced professional levels. It includes subjects that require computational expertise, such as physics and mathematics, as well as disciplines within humanities and social sciences. Many of these tasks are not easily translatable from other languages due to their specific contextual nuances and wording. Furthermore, numerous tasks within CMMLU have answers that are specific to China and may not be universally applicable or considered correct in other regions or languages.
Note: if you need Ancient Chiense Evaluation, please refer to ACLUE.
The following table displays the performance of models in the five-shot and zero-shot settings.
Five-shot
Model | STEM | Humanities | Social Science | Other | China-specific | Average |
---|---|---|---|---|---|---|
Open Access Models | ||||||
Lingzhi-72B-chat | 84.82 | 92.93 | 91.25 | 92.64 | 90.89 | 90.26 |
Spark 4.0-2024-10-14 | 84.75 | 93.53 | 90.64 | 91.03 | 90.09 | 90.07 |
Qwen2-72B | 82.80 | 93.84 | 90.38 | 92.71 | 90.60 | 89.65 |
Jiutian-大模型 | 80.58 | 93.33 | 89.81 | 91.79 | 89.8 | 88.59 |
Qwen1.5-110B | 81.59 | 92.41 | 89.14 | 91.19 | 89.02 | 88.32 |
JIUTIAN-57B | 79.79 | 91.99 | 88.57 | 90.27 | 88.02 | 87.39 |
Qwen2.5-72B | 80.35 | 88.41 | 85.96 | 86.06 | 88.91 | 85.67 |
Qwen1.5-72B | 76.83 | 88.37 | 84.15 | 86.06 | 83.77 | 83.54 |
PCI-TransGPT | 76.85 | 86.46 | 81.65 | 84.57 | 82.85 | 82.46 |
Qwen1.5-32B | 76.25 | 86.31 | 83.42 | 83.82 | 82.84 | 82.25 |
BlueLM-7B | 61.36 | 79.83 | 77.80 | 78.89 | 76.74 | 74.27 |
Qwen1.5-7B | 63.64 | 76.42 | 74.69 | 75.91 | 73.43 | 72.50 |
XuanYuan-70B | 60.74 | 77.79 | 75.47 | 70.81 | 70.92 | 71.10 |
GPT4 | 65.23 | 72.11 | 72.06 | 74.79 | 66.12 | 70.95 |
Llama-3.1-70B-Instruct | 55.05 | 66.62 | 66.08 | 70.50 | 61.65 | 64.38 |
XuanYuan-13B | 50.07 | 66.32 | 64.11 | 59.99 | 60.55 | 60.05 |
Qwen-7B | 48.39 | 63.77 | 61.22 | 62.14 | 58.73 | 58.66 |
ZhiLu-13B | 44.26 | 61.54 | 60.25 | 61.14 | 57.14 | 57.16 |
ChatGPT | 47.81 | 55.68 | 56.50 | 62.66 | 50.69 | 55.51 |
Baichuan-13B | 42.38 | 61.61 | 60.44 | 59.26 | 56.62 | 55.82 |
ChatGLM2-6B | 42.55 | 50.98 | 50.99 | 50.80 | 48.37 | 48.80 |
Baichuan-7B | 35.25 | 48.07 | 47.88 | 46.61 | 44.14 | 44.43 |
Falcon-40B | 33.33 | 43.46 | 44.28 | 44.75 | 39.46 | 41.45 |
LLaMA-65B | 34.47 | 40.24 | 41.55 | 42.88 | 37.00 | 39.80 |
ChatGLM-6B | 32.35 | 39.22 | 39.65 | 38.62 | 37.70 | 37.48 |
BatGPT-15B | 34.96 | 35.45 | 36.31 | 42.14 | 37.89 | 37.16 |
BLOOMZ-7B | 30.56 | 39.10 | 38.59 | 40.32 | 37.15 | 37.04 |
Llama-3-70B-Instruct | 30.10 | 39.38 | 32.93 | 48.05 | 37.17 | 36.85 |
Chinese-LLaMA-13B | 27.12 | 33.18 | 34.87 | 35.10 | 32.97 | 32.63 |
Bactrian-LLaMA-13B | 27.52 | 32.47 | 32.27 | 35.77 | 31.56 | 31.88 |
MOSS-SFT-16B | 27.23 | 30.41 | 28.84 | 32.56 | 28.68 | 29.57 |
Models with Limited Access | ||||||
BlueLM | 78.16 | 90.50 | 86.88 | 87.87 | 87.55 | 85.59 |
Mind GPT | 76.76 | 87.09 | 83.74 | 84.70 | 81.82 | 82.84 |
ZW-LM | 72.68 | 85.84 | 83.61 | 85.68 | 82.71 | 81.73 |
QuarkLLM | 70.97 | 85.20 | 82.88 | 82.71 | 81.12 | 80.27 |
Galaxy | 69.61 | 74.95 | 78.54 | 77.93 | 73.99 | 74.03 |
Mengzi-7B | 49.59 | 75.27 | 71.36 | 70.52 | 69.23 | 66.41 |
KwaiYii-13B | 46.54 | 69.22 | 64.49 | 65.09 | 63.10 | 61.73 |
MiLM-6B | 46.85 | 61.12 | 61.68 | 58.84 | 59.39 | 57.17 |
MiLM-1.3B | 35.59 | 49.58 | 49.03 | 47.56 | 48.17 | 45.39 |
Random | 25.00 | 25.00 | 25.00 | 25.00 | 25.00 | 25.00 |
Zero-shot
Model | STEM | Humanities | Social Science | Other | China-specific | Average |
---|---|---|---|---|---|---|
Open Access Models | ||||||
Spark 4.0-2024-10-14 | 87.36 | 93.97 | 90.03 | 92.71 | 90.4 | 90.97 |
Lingzhi-72B-chat | 84.85 | 92.99 | 90.75 | 92.47 | 90.68 | 90.07 |
Qwen1.5-110B | 80.84 | 91.51 | 89.01 | 89.99 | 88.64 | 87.64 |
Qwen2-72B | 80.92 | 90.90 | 87.93 | 91.23 | 87.24 | 87.47 |
Qwen2.5-72B | 80.67 | 87.00 | 84.66 | 87.35 | 83.21 | 84.70 |
PCI-TransGPT | 76.69 | 86.26 | 81.71 | 84.47 | 83.13 | 82.44 |
Qwen1.5-72B | 75.07 | 86.15 | 83.06 | 83.84 | 82.78 | 81.81 |
Qwen1.5-32B | 74.82 | 85.13 | 82.49 | 84.34 | 82.47 | 81.47 |
BlueLM-7B | 62.08 | 81.29 | 79.38 | 79.56 | 77.69 | 75.40 |
Qwen1.5-7B | 62.87 | 74.90 | 72.65 | 74.64 | 71.94 | 71.05 |
XuanYuan-70B | 61.21 | 76.25 | 74.44 | 70.67 | 69.35 | 70.59 |
Llama-3.1-70B-Instruct | 61.60 | 71.44 | 69.42 | 74.72 | 63.79 | 69.01 |
GPT4 | 63.16 | 69.19 | 70.26 | 73.16 | 63.47 | 68.90 |
Llama-3-70B-Instruct | 57.02 | 67.87 | 68.67 | 73.95 | 62.96 | 66.74 |
XuanYuan-13B | 50.22 | 67.55 | 63.85 | 61.17 | 61.50 | 60.51 |
Qwen-7B | 46.33 | 62.54 | 60.48 | 61.72 | 58.77 | 57.57 |
ZhiLu-13B | 43.53 | 61.60 | 61.40 | 60.15 | 58.97 | 57.14 |
ChatGPT | 44.80 | 53.61 | 54.22 | 59.95 | 49.74 | 53.22 |
Baichuan-13B | 42.04 | 60.49 | 59.55 | 56.60 | 55.72 | 54.63 |
ChatGLM2-6B | 41.28 | 52.85 | 53.37 | 52.24 | 50.58 | 49.95 |
BLOOMZ-7B | 33.03 | 45.74 | 45.74 | 46.25 | 41.58 | 42.80 |
Baichuan-7B | 32.79 | 44.43 | 46.78 | 44.79 | 43.11 | 42.33 |
ChatGLM-6B | 32.22 | 42.91 | 44.81 | 42.60 | 41.93 | 40.79 |
BatGPT-15B | 33.72 | 36.53 | 38.07 | 46.94 | 38.32 | 38.51 |
Falcon-40B | 31.11 | 41.30 | 40.87 | 40.61 | 36.05 | 38.50 |
LLaMA-65B | 31.09 | 34.45 | 36.05 | 37.94 | 32.89 | 34.88 |
Bactrian-LLaMA-13B | 26.46 | 29.36 | 31.81 | 31.55 | 29.17 | 30.06 |
Chinese-LLaMA-13B | 26.76 | 26.57 | 27.42 | 28.33 | 26.73 | 27.34 |
MOSS-SFT-16B | 25.68 | 26.35 | 27.21 | 27.92 | 26.70 | 26.88 |
Models with Limited Access | ||||||
BlueLM | 76.36 | 90.34 | 86.23 | 86.94 | 86.84 | 84.68 |
DiMind | 70.92 | 86.66 | 86.04 | 86.60 | 81.49 | 82.73 |
云天天书 | 73.03 | 83.78 | 82.30 | 84.04 | 81.37 | 80.62 |
Mind GPT | 71.20 | 83.95 | 80.59 | 82.11 | 78.90 | 79.20 |
QuarkLLM | 67.23 | 81.69 | 79.47 | 80.74 | 77.00 | 77.08 |
Galaxy | 69.38 | 75.33 | 78.27 | 78.19 | 73.25 | 73.85 |
ZW-LM | 63.93 | 77.95 | 76.28 | 72.99 | 72.94 | 72.74 |
KwaiYii-66B | 55.20 | 77.10 | 71.74 | 73.30 | 71.27 | 69.96 |
Mengzi-7B | 49.49 | 75.84 | 72.32 | 70.87 | 70.00 | 66.88 |
KwaiYii-13B | 46.82 | 69.35 | 63.42 | 64.02 | 63.26 | 61.22 |
MiLM-6B | 48.88 | 63.49 | 66.20 | 62.14 | 62.07 | 60.37 |
MiLM-1.3B | 40.51 | 54.82 | 54.15 | 53.99 | 52.26 | 50.79 |
Random | 25.00 | 25.00 | 25.00 | 25.00 | 25.00 | 25.00 |
- For open-source/API models, open pull request to update the result (you can also provide test code in
src
folder). - For not open-source/API models, update results in the cooresponding part and open pull request.
We provide our dataset according to each subject in data folder. You can also access our dataset via Hugging Face.
Our dataset has been added to lm-evaluation-harness and OpenCompass, you can evaluate your model via these open-source tools.
Each question in the dataset is a multiple-choice questions with 4 choices and only one choice as the correct answer. The data is comma saperated .csv file. Here is an example:
同一物种的两类细胞各产生一种分泌蛋白,组成这两种蛋白质的各种氨基酸含量相同,但排列顺序不同。其原因是参与这两种蛋白质合成的,tRNA种类不同,同一密码子所决定的氨基酸不同,mRNA碱基序列不同,核糖体成分不同,C
Translation:"Two types of cells within the same species each produce a secretion protein. The various amino acids that make up these two proteins have the same composition but differ in their arrangement. The reason for this difference in arrangement in the synthesis of these two proteins is,Different types of tRNA,Different amino acids determined by the same codon,Different mRNA base sequences,Different ribosome components,C"
We provide the preprocessing code in src/mp_utils directory. It includes apporach we used to generate direct answer prompt and chain-of-thought (COT) prompt.
Here is an example of data after adding direct answer prompt:
以下是关于(高中生物)的单项选择题,请直接给出正确答案的选项。
(Here are some single-choice questions about(high school biology), please provide the correct answer choice directly.)
题目:同一物种的两类细胞各产生一种分泌蛋白,组成这两种蛋白质的各种氨基酸含量相同,但排列顺序不同。其原因是参与这两种蛋白质合成的:
(Two types of cells within the same species each produce a secretion protein. The various amino acids that make up these two proteins have the same composition but differ in their arrangement. The reason for this difference in arrangement in the synthesis of these two proteins is)
A. tRNA种类不同(Different types of tRNA)
B. 同一密码子所决定的氨基酸不同(Different amino acids determined by the same codon)
C. mRNA碱基序列不同(Different mRNA base sequences)
D. 核糖体成分不同(Different ribosome components)
答案是:C(Answer: C)
... [other examples]
题目:某种植物病毒V是通过稻飞虱吸食水稻汁液在水稻间传播的。稻田中青蛙数量的增加可减少该病毒在水稻间的传播。下列叙述正确的是:
(Question: A certain plant virus, V, is transmitted between rice plants through the feeding of rice planthoppers. An increase in the number of frogs in the rice field can reduce the spread of this virus among the rice plants. The correct statement among the options provided would be)
A. 青蛙与稻飞虱是捕食关系(Frogs and rice planthoppers have a predatory relationship)
B. 水稻和病毒V是互利共生关系(Rice plants and virus V have a mutualistic symbiotic relationship)
C. 病毒V与青蛙是寄生关系(Virus V and frogs have a parasitic relationship)
D. 水稻与青蛙是竞争关系(Rice plants and frogs have a competitive relationship)
答案是: (Answer:)
For the COT prompt we modified the prompt from“请直接给出正确答案的选项 (please provide the correct answer choice directly)” to “逐步分析并选出正确答案 (Analyze step by step and select the correct answer).”
The code for evaluation of each model we used is in src, and the code to run them is listed in script directory.
@misc{li2023cmmlu,
title={CMMLU: Measuring massive multitask language understanding in Chinese},
author={Haonan Li and Yixuan Zhang and Fajri Koto and Yifei Yang and Hai Zhao and Yeyun Gong and Nan Duan and Timothy Baldwin},
year={2023},
eprint={2306.09212},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
The CMMLU dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.