Can we use VeRL to train the reward models #197

YSLIU627 · 2025-02-04T17:07:53Z

Hi Authors,
Your repo is really good and I wonder if we could use VeRL to train the reward models just like OpenRLHF.

vermouth1992 · 2025-02-05T01:42:27Z

Actually, we have a DPO implementation already on a branch. Training ORM should be pretty similar to training DPO. Do you also need training PRM?

YSLIU627 · 2025-02-05T17:40:36Z

No, I only need to train the ORM. However, the DPO implementation requires the same number of positive and negative samples, while ORM training in math can tolerate such an imbalance. It would help a lot if you could have a script for the ORM training!

vermouth1992 added the enhancement New feature or request label Feb 5, 2025

PeterSH6 assigned vermouth1992 Feb 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we use VeRL to train the reward models #197

Can we use VeRL to train the reward models #197

YSLIU627 commented Feb 4, 2025

vermouth1992 commented Feb 5, 2025

YSLIU627 commented Feb 5, 2025

Can we use VeRL to train the reward models #197

Can we use VeRL to train the reward models #197

Comments

YSLIU627 commented Feb 4, 2025

vermouth1992 commented Feb 5, 2025

YSLIU627 commented Feb 5, 2025