You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recent experiments have found that in the training data used by the GRPOTrainer, only the query from the SFT data is passed in, while the response or solution is discarded. I would like to know if, under these circumstances, can the RL training outperform the results of SFT? Additionally, why don’t we add the response from the SFT data into the list of multiple generated responses in training process?
The text was updated successfully, but these errors were encountered:
Recent experiments have found that in the training data used by the GRPOTrainer, only the query from the SFT data is passed in, while the response or solution is discarded. I would like to know if, under these circumstances, can the RL training outperform the results of SFT? Additionally, why don’t we add the response from the SFT data into the list of multiple generated responses in training process?
The text was updated successfully, but these errors were encountered: