The response in the SFT data is not fully utilized. #290

HAOChuzhan · 2025-02-12T08:50:10Z

Recent experiments have found that in the training data used by the GRPOTrainer, only the query from the SFT data is passed in, while the response or solution is discarded. I would like to know if, under these circumstances, can the RL training outperform the results of SFT? Additionally, why don’t we add the response from the SFT data into the list of multiple generated responses in training process?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The response in the SFT data is not fully utilized. #290

The response in the SFT data is not fully utilized. #290

HAOChuzhan commented Feb 12, 2025

The response in the SFT data is not fully utilized. #290

The response in the SFT data is not fully utilized. #290

Comments

HAOChuzhan commented Feb 12, 2025