Record your evaluation results to W&B (wandb) with wandb_writer.py
.
With wandb_writer.py
, you can:
- visualize the changes of evaluation metrics of your model during the training process
- make a leaderboard to compare the metrics of different models
python wandb_writer.py --config <config_file> [--print-only]
config_file
: path to the configuration file (see Configuration for details)--print-only
: only print the result to command line, do not write to wandb
We provided three example files in config
folder for three different cases.
The general format is as follows:
project: <str> # your wandb project name
base_url: <str> # your wandb instance url
# other specific configuration items
The following configuration is used to parse evaluation results from HELM output folder and record them to wandb.
# general configurations
# ...
evals: # evaluations to record
- eval_type: helm # only support helm for now
model_name: <str> # your model name
source: helm # use helm to parse from helm output directory
helm_output_dir: <your helm output dir path>
helm_suite_name: <your helm suite name>
token_per_iteration: <tokens per iteration in billions>
benchmarks: # benchmark metrics to be recorded, and below are some examples
- name: mmlu
metrics:
- EM
- name: boolq
metrics:
- EM
- name: narrative_qa
metrics:
- F1
- name: hellaswag
metrics:
- EM
- ...
We use 16 core metrics of HELM as the default benchmarks if the
benchmarks
field is not provided, the 16 metrics are as follows:mmlu.EM, raft.EM, imdb.EM, truthful_qa.EM, summarization_cnndm.ROUGE-2, summarization_xsum.ROUGE-2, boolq.EM, msmarco_trec.NDCG@10, msmarco_regular.RR@10, narrative_qa.F1, natural_qa_closedbook.F1, natural_qa_openbook_longans.F1, civil_comments.EM, hellaswag.EM, openbookqa.EM
The scores of metrics can be given directly in the configuration file, the following is an example.
# general configurations
# ...
evals: # evaluations to record
- eval_type: helm
model_name: llama-7B # your model name
source: file # use file to parse from configuration
token_num: 1000
eval_result: # evaluation results to be recorded
mmlu:
EM: 0.345
boolq:
EM: 0.751
narrative_qa:
F1: 0.524
hellaswag:
EM: 0.747
...
The following configuration is used to make a leaderboard.
# general configurations
# ...
leaderboard: True
leaderboard_metrics: # metrics required for the leaderboard
- mmlu.EM
- boolq.EM
- quac.F1
- hellaswag.EM
- ...
excluded_models: # models that do not participate in the leaderboard
- <model to exclude>
- ...
We use 16 core metrics of HELM as the default leaderboard metrics if the
leaderboard_metrics
field is not provided, the 16 metrics are as same as the default benchmark metrics.