Latest TRL code = significantly worse rewards for GRPO training #2731

abacaj · 2025-02-02T01:18:21Z

Updated

So I tried the code from the initial vLLM commit (ed14ed9) and it is working as expected with vLLM on or off. It appears to be a more recent issue then...

Reproduction

I initially thought it was because of vLLM inference but then I tried three runs on the latest commit @ a325a0e. Two without vLLM and one with vLLM (only change was to use_vllm=True/False). All runs gave worse rewards compared to an older TRL commit.

Reverting to TRL library from commit @ 4659ad9 works much much better (doesn't have vLLM yet in this commit). Note the only changes I am testing is using different versions of the TRL library (same exact code).

Dark line run is commit @ 4659ad9, all other runs are on the latest commit as mentioned.

System Info

local 3090s

Checklist

I have checked that my issue isn't already filed (see open issues)
I have included my system information
Any code provided is minimal, complete, and reproducible (more on MREs)
Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
Any traceback provided is complete

The text was updated successfully, but these errors were encountered:

qgallouedec · 2025-02-02T07:39:24Z

It's not very clear. Do you use vllm or not?

abacaj · 2025-02-02T07:43:57Z

It's not very clear. Do you use vllm or not?

latest commit doesn't seem to produce results as expected with or without vLLM. this commit works: ed14ed9

abacaj changed the title ~~Latest TRL codd = significantly worse rewards for GRPO training (not clear what changed)~~ Latest TRL code = significantly worse rewards for GRPO training (not clear what changed) Feb 2, 2025

abacaj changed the title ~~Latest TRL code = significantly worse rewards for GRPO training (not clear what changed)~~ Latest TRL code = significantly worse rewards for GRPO training Feb 2, 2025

github-actions bot added 🏋 GRPO Related to GRPO 🐛 bug Something isn't working labels Feb 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latest TRL code = significantly worse rewards for GRPO training #2731

Latest TRL code = significantly worse rewards for GRPO training #2731

abacaj commented Feb 2, 2025 •

edited

Loading

qgallouedec commented Feb 2, 2025

abacaj commented Feb 2, 2025

Latest TRL code = significantly worse rewards for GRPO training #2731

Latest TRL code = significantly worse rewards for GRPO training #2731

Comments

abacaj commented Feb 2, 2025 • edited Loading

Reproduction

System Info

Checklist

qgallouedec commented Feb 2, 2025

abacaj commented Feb 2, 2025

abacaj commented Feb 2, 2025 •

edited

Loading