使用lm-eval工具对开源数据的10%进行测试对比，这个性能结果正常嘛 #281

mathCrazyy · 2025-02-05T12:29:07Z

模型	说明	agieval_logiqa_zh	arc_zh	cmmlu	mmlu	paws_zh	truthfulqa_zh_mc1	truthfulqa_zh_mc2	xcopa_zh	xnli_zh	xstorycloze_zh	xwinograd_zh
Qwen2.5-7B-Instruct	baseline	0.6818	0.5128	0.797	0.7112	0.525	0.3418	0.5256	0.74	0.3534	0.7105	0.8235
DeepSeek-R1-Distill-Qwen-7B		0.3636 (53.3294%)	0.2991 (58.3268%)	0.4987 (62.5721%)	0.5238 (73.6502%)	0.545 (103.8095%)	0.2152 (62.9608%)	0.3775 (71.8227%)	0.58 (78.3784%)	0.3815 (107.9513%)	0.6711 (94.4546%)	0.6863 (83.3394%)

mathCrazyy · 2025-02-05T12:33:28Z

有木有类似测试结果的

mathCrazyy · 2025-02-06T05:34:52Z

Model	alias	acc_norm	acc_norm_stderr	effective	original
DeepSeek-R1-Distill-Qwen-7B	leaderboard_gpqa_diamond	0.3232	0.0333	198	198
Qwen2.5-7B-Instruct	leaderboard_gpqa_diamond	0.3333	0.0336	198	198

mathCrazyy · 2025-02-06T14:24:12Z

when I use https://tiger-ai-lab.github.io/CritiqueFineTuning/
I can get the result

	数据量	Qwen2.5-7B-Instruct	DeepSeek-R1-Distill-Qwen-7B
math-500	500	77.4	57.6, 62.6, 73.6, 75, 75.8,76.8
aime24	30	13.3	6.7, 13.3, 20
gpqa	198	0.4090	0.4343

However, there is still a big gap with the results shown in the paper.
(math-500 is 92.8 when model is DeepSeek-R1-Distill-Qwen-7B, My System Prompt is
"deepseek-math_76.8": ( """A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process is enclosed within <think> </think> and the answer put in \\boxed{{}}. tags, respectively. User:{input}\nAssistant:""", "{output}", "\n\n\n", ),

What should I do, thank you.

mathCrazyy · 2025-02-10T02:28:48Z

set the max length to 16384, will be okay. math-500 to 86.8%, time 24 to 50%, and gpqa to 47.47%.

mathCrazyy changed the title ~~使用lm-eval工具对开源数据的10%进行测试对比，这个性能正常嘛~~ 使用lm-eval工具对开源数据的10%进行测试对比，这个性能结果正常嘛 Feb 5, 2025

mathCrazyy closed this as completed Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

使用lm-eval工具对开源数据的10%进行测试对比，这个性能结果正常嘛 #281

使用lm-eval工具对开源数据的10%进行测试对比，这个性能结果正常嘛 #281

mathCrazyy commented Feb 5, 2025 •

edited

Loading

mathCrazyy commented Feb 5, 2025

mathCrazyy commented Feb 6, 2025

mathCrazyy commented Feb 6, 2025 •

edited

Loading

mathCrazyy commented Feb 10, 2025

使用lm-eval工具对开源数据的10%进行测试对比，这个性能结果正常嘛 #281

使用lm-eval工具对开源数据的10%进行测试对比，这个性能结果正常嘛 #281

Comments

mathCrazyy commented Feb 5, 2025 • edited Loading

mathCrazyy commented Feb 5, 2025

mathCrazyy commented Feb 6, 2025

mathCrazyy commented Feb 6, 2025 • edited Loading

mathCrazyy commented Feb 10, 2025

mathCrazyy commented Feb 5, 2025 •

edited

Loading

mathCrazyy commented Feb 6, 2025 •

edited

Loading