Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRPO questions #2608

Open
natolambert opened this issue Jan 22, 2025 · 0 comments
Open

GRPO questions #2608

natolambert opened this issue Jan 22, 2025 · 0 comments
Labels
🏋 GRPO Related to GRPO ❓ question Seeking clarification or more information

Comments

@natolambert
Copy link
Contributor

Hey friends! I have some questions on the GRPO implementation, happy to discuss.

  1. It looks like you apply the KL distance in the advantages, while the DeepSeekMath paper says “Also note that, instead of adding KLpenalty in the reward, GRPO regularizes by directly adding the KL divergence between the trained policy and the reference policy to the loss, avoiding complicating the calculation of 𝐴ˆ”
  2. Did any thought go into making this a sum of loss and not mean? We aren’t sure
    loss = ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean()
  3. I didn’t see the PPO clipping logic in policy gradient loss, coming soon?
@github-actions github-actions bot added 🏋 PPO Related to PPO ❓ question Seeking clarification or more information labels Jan 22, 2025
@August-murr August-murr added 🏋 GRPO Related to GRPO and removed 🏋 PPO Related to PPO labels Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏋 GRPO Related to GRPO ❓ question Seeking clarification or more information
Projects
None yet
Development

No branches or pull requests

2 participants