【not bug】how to distill qwen2.5 7b to qwen2.5 3b. #2416

whk6688 · 2025-02-20T12:55:58Z

i want to distill qwen2.5 7b to qwen2.5 3b. but they have different vocab size. now i choice to crop tensor (teacher_logits[:,:151936]). is there better way to solve the issue?

thanks

ebsmothers · 2025-02-23T19:32:24Z

Hi @whk6688 thanks for creating the issue. In general it's recommended to distill model with only the same vocab sizes, however in this case I believe you may be able to get away with the approach you're suggesting. You can see the following excerpt from this comment:

You can always resize the embedding, as the index over 151646 is meaningless.

So based on that I think your proposed approach should work.

joecummings assigned ebsmothers Feb 25, 2025

joecummings added the triaged This issue has been assigned an owner and appropriate label label Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【not bug】how to distill qwen2.5 7b to qwen2.5 3b. #2416

【not bug】how to distill qwen2.5 7b to qwen2.5 3b. #2416

whk6688 commented Feb 20, 2025

ebsmothers commented Feb 23, 2025

【not bug】how to distill qwen2.5 7b to qwen2.5 3b. #2416

【not bug】how to distill qwen2.5 7b to qwen2.5 3b. #2416

Comments

whk6688 commented Feb 20, 2025

ebsmothers commented Feb 23, 2025