Allow more flexibility in new PPOTrainer (aka PPOv2) #2576

Benjoyo · 2025-01-16T15:54:34Z

Feature request

Allow more flexibility and hackability of the PPOTrainer (and RLOOTrainer for that matter) by exposing more parameters and refactoring the train loop to be more modular with overridable functions.

Motivation

I feel like the new PPOTrainer is way less flexible compared to the old one and as soon as you want to deviate from the common case the trainer was built for your are basically forced to c&p it and modify multiple places.

I'll use the now completely dysfunctional tools examples (examples/research_projects/tools) using the TextEnvironment to illustrate this:

reward_model as nn.Module is mandatory (see [question] best way to have my own reward model which is backed by rules #2518)
- often you don't need a reward model but have a simple function, like in the tools examples
train_dataset is mandatory
- usually this will make sense, but not always! Data can also be (and sometimes must be) generated on the fly, like in the calculator example
GenerationConfig is hardcoded and there are no generation_kwargs exposed, only response_length and temperature
- users should have control over the generation, e.g. in the tool example it seems wasteful to generate a fixed number of tokens, instead we may want to stop after the next tool call, for which stop_strings would be helpful (eos is not enough)
no step-wise training
- the train loop is one giant function, in the old trainer there was a ppo_trainer.step(queries, responses, rewards, masks) function that played nicely with the tool example and in general allows a flexible, custom training loop

I hope we can get some refactorings for more flexibility. The reward function topic seems to get addressed already, exposing generation_kwargs should be trivial.

Adding a training step function for easy custom loops or overriding the default behavior, I don't know how complex that would be. But I feel like changes in that direction are required if TRL wants to be the default reinforcement learning library - it should be hackable and seamlessly work for the more advanced use cases :)

Your contribution

none right now, sorry

The text was updated successfully, but these errors were encountered:

RobertMcCarthy97 · 2025-01-20T11:07:47Z

Agree with above, found old PPOTrainer much more flexible and usable!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow more flexibility in new PPOTrainer (aka PPOv2) #2576

Allow more flexibility in new PPOTrainer (aka PPOv2) #2576

Benjoyo commented Jan 16, 2025 •

edited

Loading

RobertMcCarthy97 commented Jan 20, 2025

Allow more flexibility in new PPOTrainer (aka PPOv2) #2576

Allow more flexibility in new PPOTrainer (aka PPOv2) #2576

Comments

Benjoyo commented Jan 16, 2025 • edited Loading

Feature request

Motivation

Your contribution

RobertMcCarthy97 commented Jan 20, 2025

Benjoyo commented Jan 16, 2025 •

edited

Loading