You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Did you use a special validation set for ultrafeedback when tuning the hyper-paramaters in Table 7, or just the test_pref set from the original binarized ultrafeedback data. I notice that in your on-policy datasets you only include train and test splits (and that the test splits are exactly the test_pref instances).
The text was updated successfully, but these errors were encountered:
Yes, we did include a test split for the UltraFeedback preference data. In our early experiment, we found that the win rates on the prompts from the test split are strongly correlated with the win rates on chat benchmarks (e.g., AlpacaEval 2 and Arena-Hard). So in principle, the test split can be used for hyperparameter tuning.
However, this test split has more instances than AlpacaEval 2 (~800 prompts) and Arena-Hard (~500 prompts) and incurs a higher cost (e.g., LLM-as-judge APIs) for hyperparameter tuning. Therefore, in practice, we did hyperparameter tuning based on the benchmark scores across all the evaluation sets. Given the high correlation between the win rates on these benchmarks (e.g., AlpacaEval 2 & Arena-Hard) and human evaluations, these scores provide a good proxy for hyperparameter tuning using human judgments.
Did you use a special validation set for ultrafeedback when tuning the hyper-paramaters in Table 7, or just the
test_pref
set from the original binarized ultrafeedback data. I notice that in your on-policy datasets you only includetrain
andtest
splits (and that the test splits are exactly thetest_pref
instances).The text was updated successfully, but these errors were encountered: