GRPO questions #2608

natolambert · 2025-01-22T23:00:49Z

Hey friends! I have some questions on the GRPO implementation, happy to discuss.

It looks like you apply the KL distance in the advantages, while the DeepSeekMath paper says “Also note that, instead of adding KLpenalty in the reward, GRPO regularizes by directly adding the KL divergence between the trained policy and the reference policy to the loss, avoiding complicating the calculation of 𝐴ˆ”
Did any thought go into making this a sum of loss and not mean? We aren’t sure

trl/trl/trainer/grpo_trainer.py

Line 286 in fe4b5ef

loss = ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean()
I didn’t see the PPO clipping logic in policy gradient loss, coming soon?

github-actions bot added 🏋 PPO Related to PPO ❓ question Seeking clarification or more information labels Jan 22, 2025

August-murr added 🏋 GRPO Related to GRPO and removed 🏋 PPO Related to PPO labels Jan 23, 2025

Provide feedback