You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So I tried the code from the initial vLLM commit (ed14ed9) and it is working as expected with vLLM on or off. It appears to be a more recent issue then...
Reproduction
I initially thought it was because of vLLM inference but then I tried three runs on the latest commit @ a325a0e. Two without vLLM and one with vLLM (only change was to use_vllm=True/False). All runs gave worse rewards compared to an older TRL commit.
Reverting to TRL library from commit @ 4659ad9 works much much better (doesn't have vLLM yet in this commit). Note the only changes I am testing is using different versions of the TRL library (same exact code).
Dark line run is commit @ 4659ad9, all other runs are on the latest commit as mentioned.
System Info
local 3090s
Checklist
I have checked that my issue isn't already filed (see open issues)
I have included my system information
Any code provided is minimal, complete, and reproducible (more on MREs)
Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
Any traceback provided is complete
The text was updated successfully, but these errors were encountered:
abacaj
changed the title
Latest TRL codd = significantly worse rewards for GRPO training (not clear what changed)
Latest TRL code = significantly worse rewards for GRPO training (not clear what changed)
Feb 2, 2025
abacaj
changed the title
Latest TRL code = significantly worse rewards for GRPO training (not clear what changed)
Latest TRL code = significantly worse rewards for GRPO training
Feb 2, 2025
Updated
So I tried the code from the initial vLLM commit (ed14ed9) and it is working as expected with vLLM on or off. It appears to be a more recent issue then...
Reproduction
I initially thought it was because of vLLM inference but then I tried three runs on the latest
commit @ a325a0e
. Two without vLLM and one with vLLM (only change was touse_vllm=True/False
). All runs gave worse rewards compared to an older TRL commit.Reverting to TRL library from
commit @ 4659ad9
works much much better (doesn't have vLLM yet in this commit). Note the only changes I am testing is using different versions of the TRL library (same exact code).Dark line run is
commit @ 4659ad9
, all other runs are on the latest commit as mentioned.System Info
local 3090s
Checklist
The text was updated successfully, but these errors were encountered: