You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey friends! I have some questions on the GRPO implementation, happy to discuss.
It looks like you apply the KL distance in the advantages, while the DeepSeekMath paper says “Also note that, instead of adding KLpenalty in the reward, GRPO regularizes by directly adding the KL divergence between the trained policy and the reference policy to the loss, avoiding complicating the calculation of 𝐴ˆ”
Did any thought go into making this a sum of loss and not mean? We aren’t sure
Hey friends! I have some questions on the GRPO implementation, happy to discuss.
trl/trl/trainer/grpo_trainer.py
Line 286 in fe4b5ef
The text was updated successfully, but these errors were encountered: