-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How many llama models are used for constructing llama-moe ? moe的构建是通过多个llama模型还是1个llama模型 #55
Comments
感谢您对本项目的关注❤️
Hi there, thanks for your attention on this project❤️
|
谢谢大佬,后续有无考虑创建一个微信群聊,大家一起讨论moe |
你好,有些其他问题请教。 Hello, I have some other questions to consult. |
We had tested to firstly freeze other parameters and pre-train the gates. However, as more tokens consumed during continual pre-training, the two-stage pre-training didn't show advantages. So we keep the simplicity and train the whole model without specific gating magics. |
大概多少token这两个方案会基本一致呢?如果不特殊处理gate,在多少token时loss会下降到相对合理的水平?我现在loss大概在4.x,梯度在大几千的水平并且还在持续上升。根据之前的经验看,这么大的梯度似乎是不正确的。 How many tokens approximately will make these two approaches essentially consistent? If gates are not handled with special care, at what token count does the loss generally decrease to a reasonably acceptable level? Currently, my loss is around 4.x, and the gradients are at several thousand levels, continuously increasing. Based on previous experience, such large gradients seem to be incorrect. |
Hi there~ For multi-stage pre-training comparison, it takes about 20B tokens. It may take about 20~30B tokens to reach a relative low loss values (2.1). But 20B tokens for gate pre-training may be not an effective training recipe (loss get convergence in 5-10B), you could try different settings to find a better one. |
请问moe模型的构建是通过多个llama模型还是1个llama模型呢?
请问这个repo的用途是将1个llama模型的FFN层通过不同的切分方法,切分为多个FFN来扮演多个专家。然后将llama模型的其余模型层和权重与切分后的FFN和gate进行拼接变成moe模型嘛?
是否支持将多个llama结构模型的FFN层合并,基于一个base 的llama模型结构构建Moe呢?
The text was updated successfully, but these errors were encountered: