Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about reward function in GRPO example #2771

Closed
junuMoon opened this issue Feb 5, 2025 · 5 comments · Fixed by #2777
Closed

Question about reward function in GRPO example #2771

junuMoon opened this issue Feb 5, 2025 · 5 comments · Fixed by #2777
Labels
🏋 GRPO Related to GRPO ❓ question Seeking clarification or more information 🏋 Reward Related to Reward modelling

Comments

@junuMoon
Copy link
Contributor

junuMoon commented Feb 5, 2025

https://huggingface.co/docs/trl/main/en/grpo_trainer

Am I misunderstanding something about the reward function in the documentation example?

Looking at the current example:

def reward_len(completions, **kwargs):
    return [abs(20 - len(completion)) for completion in completions]

I think this reward function might be working opposite to what we want. Since GRPO tries to maximize the reward, wouldn't this make the model generate text that's far from 20 characters?

Shouldn't we use something like:

def reward_len(completions, **kwargs):
    return [-abs(20 - len(completion)) for completion in completions]

or

def reward_len(completions, **kwargs):
    return [1 / (1 + abs(20 - len(completion))) for completion in completions]

so that completions closer to 20 characters get higher rewards?

Would appreciate if someone could clarify this

@github-actions github-actions bot added 🏋 GRPO Related to GRPO 🏋 Reward Related to Reward modelling ❓ question Seeking clarification or more information labels Feb 5, 2025
@qgallouedec
Copy link
Member

Ah yes, you are right. Do you want to correct it?

@cfpark00
Copy link

cfpark00 commented Feb 5, 2025

#2714

@cfpark00
Copy link

cfpark00 commented Feb 5, 2025

will close this as well when addressed!

@junuMoon
Copy link
Contributor Author

junuMoon commented Feb 5, 2025

Is it okay that I make PR on this?

@susumuota
Copy link

susumuota commented Feb 5, 2025

Thank you so much! I have the same problem.

I'm trying this reward function.

def reward_len(completions, **kwargs):
    return [np.exp(-0.1 * abs(len(completion) - 20)) for completion in completions]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏋 GRPO Related to GRPO ❓ question Seeking clarification or more information 🏋 Reward Related to Reward modelling
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants