Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest TRL code = significantly worse rewards for GRPO training #2731

Open
5 tasks done
abacaj opened this issue Feb 2, 2025 · 2 comments
Open
5 tasks done

Latest TRL code = significantly worse rewards for GRPO training #2731

abacaj opened this issue Feb 2, 2025 · 2 comments
Labels
🐛 bug Something isn't working 🏋 GRPO Related to GRPO

Comments

@abacaj
Copy link

abacaj commented Feb 2, 2025

Updated

So I tried the code from the initial vLLM commit (ed14ed9) and it is working as expected with vLLM on or off. It appears to be a more recent issue then...

Reproduction

I initially thought it was because of vLLM inference but then I tried three runs on the latest commit @ a325a0e. Two without vLLM and one with vLLM (only change was to use_vllm=True/False). All runs gave worse rewards compared to an older TRL commit.

Reverting to TRL library from commit @ 4659ad9 works much much better (doesn't have vLLM yet in this commit). Note the only changes I am testing is using different versions of the TRL library (same exact code).

Dark line run is commit @ 4659ad9, all other runs are on the latest commit as mentioned.

Image

System Info

local 3090s

Checklist

  • I have checked that my issue isn't already filed (see open issues)
  • I have included my system information
  • Any code provided is minimal, complete, and reproducible (more on MREs)
  • Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
  • Any traceback provided is complete
@abacaj abacaj changed the title Latest TRL codd = significantly worse rewards for GRPO training (not clear what changed) Latest TRL code = significantly worse rewards for GRPO training (not clear what changed) Feb 2, 2025
@abacaj abacaj changed the title Latest TRL code = significantly worse rewards for GRPO training (not clear what changed) Latest TRL code = significantly worse rewards for GRPO training Feb 2, 2025
@github-actions github-actions bot added 🏋 GRPO Related to GRPO 🐛 bug Something isn't working labels Feb 2, 2025
@qgallouedec
Copy link
Member

It's not very clear. Do you use vllm or not?

@abacaj
Copy link
Author

abacaj commented Feb 2, 2025

It's not very clear. Do you use vllm or not?

latest commit doesn't seem to produce results as expected with or without vLLM. this commit works: ed14ed9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug Something isn't working 🏋 GRPO Related to GRPO
Projects
None yet
Development

No branches or pull requests

2 participants