🙃 Fix reward function in GRPO example #2777

junuMoon · 2025-02-06T01:54:02Z

What does this PR do?

Change reward function to return normalized values between 0 and 1, where completions closer to target length (20) get higher rewards.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@qgallouedec

Change reward function to return normalized values between 0 and 1, where completions closer to target length (20) get higher rewards.

docs/source/grpo_trainer.md

fix reward function in GRPO example

283e9ca

Change reward function to return normalized values between 0 and 1, where completions closer to target length (20) get higher rewards.

qgallouedec reviewed Feb 6, 2025

View reviewed changes

docs/source/grpo_trainer.md Outdated Show resolved Hide resolved

refactor reward function more simple

f5df03e

qgallouedec changed the title ~~fix reward function in GRPO example~~ 🙃 Fix reward function in GRPO example Feb 6, 2025

qgallouedec approved these changes Feb 6, 2025

View reviewed changes

qgallouedec merged commit e95f9fb into huggingface:main Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🙃 Fix reward function in GRPO example #2777

🙃 Fix reward function in GRPO example #2777

junuMoon commented Feb 6, 2025

🙃 Fix reward function in GRPO example #2777

🙃 Fix reward function in GRPO example #2777

Conversation

junuMoon commented Feb 6, 2025

What does this PR do?

Before submitting

Who can review?