Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用lm-eval工具对开源数据的10%进行测试对比,这个性能结果正常嘛 #281

Closed
mathCrazyy opened this issue Feb 5, 2025 · 4 comments

Comments

@mathCrazyy
Copy link

mathCrazyy commented Feb 5, 2025

模型 说明 agieval_logiqa_zh arc_zh cmmlu mmlu paws_zh truthfulqa_zh_mc1 truthfulqa_zh_mc2 xcopa_zh xnli_zh xstorycloze_zh xwinograd_zh
Qwen2.5-7B-Instruct baseline 0.6818 0.5128 0.797 0.7112 0.525 0.3418 0.5256 0.74 0.3534 0.7105 0.8235
DeepSeek-R1-Distill-Qwen-7B   0.3636 (53.3294%) 0.2991 (58.3268%) 0.4987 (62.5721%) 0.5238 (73.6502%) 0.545 (103.8095%) 0.2152 (62.9608%) 0.3775 (71.8227%) 0.58 (78.3784%) 0.3815 (107.9513%) 0.6711 (94.4546%) 0.6863 (83.3394%)
@mathCrazyy mathCrazyy changed the title 使用lm-eval工具对开源数据的10%进行测试对比,这个性能正常嘛 使用lm-eval工具对开源数据的10%进行测试对比,这个性能结果正常嘛 Feb 5, 2025
@mathCrazyy
Copy link
Author

有木有类似测试结果的

@mathCrazyy
Copy link
Author

Model alias acc_norm acc_norm_stderr effective original
DeepSeek-R1-Distill-Qwen-7B leaderboard_gpqa_diamond 0.3232 0.0333 198 198
Qwen2.5-7B-Instruct leaderboard_gpqa_diamond 0.3333 0.0336 198 198

@mathCrazyy
Copy link
Author

mathCrazyy commented Feb 6, 2025

when I use https://tiger-ai-lab.github.io/CritiqueFineTuning/
I can get the result

  数据量 Qwen2.5-7B-Instruct DeepSeek-R1-Distill-Qwen-7B
math-500 500 77.4 57.6, 62.6, 73.6, 75, 75.8,76.8
aime24 30 13.3 6.7, 13.3, 20
gpqa 198 0.4090 0.4343

However, there is still a big gap with the results shown in the paper.
(math-500 is 92.8 when model is DeepSeek-R1-Distill-Qwen-7B, My System Prompt is
"deepseek-math_76.8": ( """A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process is enclosed within <think> </think> and the answer put in \\boxed{{}}. tags, respectively. User:{input}\nAssistant:""", "{output}", "\n\n\n", ),

What should I do, thank you.

@mathCrazyy
Copy link
Author

set the max length to 16384, will be okay. math-500 to 86.8%, time 24 to 50%, and gpqa to 47.47%.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant