Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about tuning set #74

Open
yakazimir opened this issue Nov 13, 2024 · 1 comment
Open

Question about tuning set #74

yakazimir opened this issue Nov 13, 2024 · 1 comment

Comments

@yakazimir
Copy link

yakazimir commented Nov 13, 2024

Did you use a special validation set for ultrafeedback when tuning the hyper-paramaters in Table 7, or just the test_pref set from the original binarized ultrafeedback data. I notice that in your on-policy datasets you only include train and test splits (and that the test splits are exactly the test_pref instances).

@yumeng5
Copy link
Collaborator

yumeng5 commented Nov 17, 2024

Hi @yakazimir

Yes, we did include a test split for the UltraFeedback preference data. In our early experiment, we found that the win rates on the prompts from the test split are strongly correlated with the win rates on chat benchmarks (e.g., AlpacaEval 2 and Arena-Hard). So in principle, the test split can be used for hyperparameter tuning.

However, this test split has more instances than AlpacaEval 2 (~800 prompts) and Arena-Hard (~500 prompts) and incurs a higher cost (e.g., LLM-as-judge APIs) for hyperparameter tuning. Therefore, in practice, we did hyperparameter tuning based on the benchmark scores across all the evaluation sets. Given the high correlation between the win rates on these benchmarks (e.g., AlpacaEval 2 & Arena-Hard) and human evaluations, these scores provide a good proxy for hyperparameter tuning using human judgments.

Best,
Yu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants