You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Allow more flexibility and hackability of the PPOTrainer (and RLOOTrainer for that matter) by exposing more parameters and refactoring the train loop to be more modular with overridable functions.
Motivation
I feel like the new PPOTrainer is way less flexible compared to the old one and as soon as you want to deviate from the common case the trainer was built for your are basically forced to c&p it and modify multiple places.
I'll use the now completely dysfunctional tools examples (examples/research_projects/tools) using the TextEnvironment to illustrate this:
often you don't need a reward model but have a simple function, like in the tools examples
train_dataset is mandatory
usually this will make sense, but not always! Data can also be (and sometimes must be) generated on the fly, like in the calculator example
GenerationConfig is hardcoded and there are no generation_kwargs exposed, only response_length and temperature
users should have control over the generation, e.g. in the tool example it seems wasteful to generate a fixed number of tokens, instead we may want to stop after the next tool call, for which stop_strings would be helpful (eos is not enough)
no step-wise training
the train loop is one giant function, in the old trainer there was a ppo_trainer.step(queries, responses, rewards, masks) function that played nicely with the tool example and in general allows a flexible, custom training loop
I hope we can get some refactorings for more flexibility. The reward function topic seems to get addressed already, exposing generation_kwargs should be trivial.
Adding a training step function for easy custom loops or overriding the default behavior, I don't know how complex that would be. But I feel like changes in that direction are required if TRL wants to be the default reinforcement learning library - it should be hackable and seamlessly work for the more advanced use cases :)
Your contribution
none right now, sorry
The text was updated successfully, but these errors were encountered:
Feature request
Allow more flexibility and hackability of the PPOTrainer (and RLOOTrainer for that matter) by exposing more parameters and refactoring the train loop to be more modular with overridable functions.
Motivation
I feel like the new PPOTrainer is way less flexible compared to the old one and as soon as you want to deviate from the common case the trainer was built for your are basically forced to c&p it and modify multiple places.
I'll use the now completely dysfunctional tools examples (
examples/research_projects/tools
) using the TextEnvironment to illustrate this:generation_kwargs
exposed, onlyresponse_length
andtemperature
stop_strings
would be helpful (eos is not enough)ppo_trainer.step(queries, responses, rewards, masks)
function that played nicely with the tool example and in general allows a flexible, custom training loopI hope we can get some refactorings for more flexibility. The reward function topic seems to get addressed already, exposing generation_kwargs should be trivial.
Adding a training step function for easy custom loops or overriding the default behavior, I don't know how complex that would be. But I feel like changes in that direction are required if TRL wants to be the default reinforcement learning library - it should be hackable and seamlessly work for the more advanced use cases :)
Your contribution
none right now, sorry
The text was updated successfully, but these errors were encountered: