[arXiv]
Run notebook_gpt4/gpt-4-1106-preview_vs_nil.ipynb to get the null response augmented with the adversarial string.
Run 01_prepare_submission.ipynb to craft the null model submission.
To install the stable release of AlpacaEval 2.0, run
pip install alpaca-eval
Then you can use it to evaluate the submission as follows:
export OPENAI_API_KEY=<your_api_key> # for more complex configs, e.g. using Azure or switching clients see client_configs/README.md
alpaca_eval --model_outputs 'example/outputs.json'
Run 02_re_evaluate_submission.ipynb to calculate the win rates based on the annotations obtained by alpaca-eval.
For example, you can get the following win rates using the alpaca-eval annotations of our null model.
{'win_rate': 76.91979180386511,
'standard_error': 0.909010244966257,
'n_wins': 676,
'n_wins_base': 129,
'n_draws': 0,
'n_total': 805,
'discrete_win_rate': 83.97515527950311,
'length_controlled_winrate': 86.45780691307944,
'lc_standard_error': 0.1418000511342794}