Question about reward function in GRPO example #2771

junuMoon · 2025-02-05T11:57:10Z

https://huggingface.co/docs/trl/main/en/grpo_trainer

Am I misunderstanding something about the reward function in the documentation example?

Looking at the current example:

def reward_len(completions, **kwargs):
    return [abs(20 - len(completion)) for completion in completions]

I think this reward function might be working opposite to what we want. Since GRPO tries to maximize the reward, wouldn't this make the model generate text that's far from 20 characters?

Shouldn't we use something like:

def reward_len(completions, **kwargs):
    return [-abs(20 - len(completion)) for completion in completions]

or

def reward_len(completions, **kwargs):
    return [1 / (1 + abs(20 - len(completion))) for completion in completions]

so that completions closer to 20 characters get higher rewards?

Would appreciate if someone could clarify this

qgallouedec · 2025-02-05T13:54:38Z

Ah yes, you are right. Do you want to correct it?

cfpark00 · 2025-02-05T15:48:54Z

#2714

cfpark00 · 2025-02-05T15:49:17Z

will close this as well when addressed!

junuMoon · 2025-02-05T16:14:27Z

Is it okay that I make PR on this?

susumuota · 2025-02-05T17:16:54Z

Thank you so much! I have the same problem.

I'm trying this reward function.

def reward_len(completions, **kwargs):
    return [np.exp(-0.1 * abs(len(completion) - 20)) for completion in completions]

github-actions bot added 🏋 GRPO Related to GRPO 🏋 Reward Related to Reward modelling ❓ question Seeking clarification or more information labels Feb 5, 2025

junuMoon mentioned this issue Feb 6, 2025

🙃 Fix reward function in GRPO example #2777

Merged

5 tasks

qgallouedec closed this as completed in #2777 Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about reward function in GRPO example #2771

Question about reward function in GRPO example #2771

junuMoon commented Feb 5, 2025

qgallouedec commented Feb 5, 2025

cfpark00 commented Feb 5, 2025

cfpark00 commented Feb 5, 2025

junuMoon commented Feb 5, 2025

susumuota commented Feb 5, 2025 •

edited

Loading

Question about reward function in GRPO example #2771

Question about reward function in GRPO example #2771

Comments

junuMoon commented Feb 5, 2025

qgallouedec commented Feb 5, 2025

cfpark00 commented Feb 5, 2025

cfpark00 commented Feb 5, 2025

junuMoon commented Feb 5, 2025

susumuota commented Feb 5, 2025 • edited Loading

susumuota commented Feb 5, 2025 •

edited

Loading