Question about reward function in GRPO example #2771
Labels
🏋 GRPO
Related to GRPO
❓ question
Seeking clarification or more information
🏋 Reward
Related to Reward modelling
https://huggingface.co/docs/trl/main/en/grpo_trainer
Am I misunderstanding something about the reward function in the documentation example?
Looking at the current example:
I think this reward function might be working opposite to what we want. Since GRPO tries to maximize the reward, wouldn't this make the model generate text that's far from 20 characters?
Shouldn't we use something like:
or
so that completions closer to 20 characters get higher rewards?
Would appreciate if someone could clarify this
The text was updated successfully, but these errors were encountered: