Why are Chat's Critic model parameters updated with Actor's parameters? #3400
Replies: 2 comments
-
Just few weeks of trading with a professional broker, I was able to withdraw $15,400, on which my starting capital was |
Beta Was this translation helpful? Give feedback.
-
For your information, I'm familiar with TRL's PPO implementation. The original PPO's algorithm has only the Critic Model(aka. reward model). I can understand the Critic model separated from Reward model to control output better, I just don't understand why they are identical from the begging of the training and then Critic model is updated and Reward model remains unchanged. Especially the final exported model is still the language model. Why are these two models merged as one just like TRL's PPOTrainer? |
Beta Was this translation helpful? Give feedback.
-
Hi, all developers,
In the current PPO's trainer implementation, Critic model is almost the same with the reward model. During training, the critic model is used to give the response sequences values, reward model is used to give reward scores,
However, currently the critic and reward model is almost identical from the beginning of the training,
So the difference between value and reward is the same from beginning. After several updates after training, the parameters of critic model are updated while reward model's parameters are kept frozen,
How does that mechanism work?
Beta Was this translation helpful? Give feedback.
All reactions