You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
No, I only need to train the ORM. However, the DPO implementation requires the same number of positive and negative samples, while ORM training in math can tolerate such an imbalance. It would help a lot if you could have a script for the ORM training!
Hi Authors,
Your repo is really good and I wonder if we could use VeRL to train the reward models just like OpenRLHF.
The text was updated successfully, but these errors were encountered: