Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow more flexibility in new PPOTrainer (aka PPOv2) #2576

Open
Benjoyo opened this issue Jan 16, 2025 · 1 comment
Open

Allow more flexibility in new PPOTrainer (aka PPOv2) #2576

Benjoyo opened this issue Jan 16, 2025 · 1 comment

Comments

@Benjoyo
Copy link

Benjoyo commented Jan 16, 2025

Feature request

Allow more flexibility and hackability of the PPOTrainer (and RLOOTrainer for that matter) by exposing more parameters and refactoring the train loop to be more modular with overridable functions.

Motivation

I feel like the new PPOTrainer is way less flexible compared to the old one and as soon as you want to deviate from the common case the trainer was built for your are basically forced to c&p it and modify multiple places.

I'll use the now completely dysfunctional tools examples (examples/research_projects/tools) using the TextEnvironment to illustrate this:

  • reward_model as nn.Module is mandatory (see [question] best way to have my own reward model which is backed by rules #2518)
    • often you don't need a reward model but have a simple function, like in the tools examples
  • train_dataset is mandatory
    • usually this will make sense, but not always! Data can also be (and sometimes must be) generated on the fly, like in the calculator example
  • GenerationConfig is hardcoded and there are no generation_kwargs exposed, only response_length and temperature
    • users should have control over the generation, e.g. in the tool example it seems wasteful to generate a fixed number of tokens, instead we may want to stop after the next tool call, for which stop_strings would be helpful (eos is not enough)
  • no step-wise training
    • the train loop is one giant function, in the old trainer there was a ppo_trainer.step(queries, responses, rewards, masks) function that played nicely with the tool example and in general allows a flexible, custom training loop

I hope we can get some refactorings for more flexibility. The reward function topic seems to get addressed already, exposing generation_kwargs should be trivial.

Adding a training step function for easy custom loops or overriding the default behavior, I don't know how complex that would be. But I feel like changes in that direction are required if TRL wants to be the default reinforcement learning library - it should be hackable and seamlessly work for the more advanced use cases :)

Your contribution

none right now, sorry

@RobertMcCarthy97
Copy link

Agree with above, found old PPOTrainer much more flexible and usable!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants