Question about kl_penalty #211

StarDewXXX · 2025-02-06T02:19:10Z

kl_penalty is calculated into batch['token_level_reward'] in function apply_kl_penalty() (trainer/ppo/ray_trainer.py). But in function update_policy(), kl_loss is added to the final loss again. (workers/actor/dp_actor.py). So KL penalty might be applied twice?

PeterSH6 · 2025-02-06T02:40:36Z

Hi @StarDewXXX, if kl_loss is used, the kl_penalty will not be applied to the reward. You can see the code: https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/ray_trainer.py#L717-L723

PeterSH6 added the question Further information is requested label Feb 9, 2025

PeterSH6 self-assigned this Feb 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about kl_penalty #211

Question about kl_penalty #211

StarDewXXX commented Feb 6, 2025

PeterSH6 commented Feb 6, 2025 •

edited

Loading

Question about kl_penalty #211

Question about kl_penalty #211

Comments

StarDewXXX commented Feb 6, 2025

PeterSH6 commented Feb 6, 2025 • edited Loading

PeterSH6 commented Feb 6, 2025 •

edited

Loading