-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve inspect score
command
#1293
Conversation
ed315e8
to
42ffea9
Compare
We currently record these in results with the expanded scores, but we need to record them for the case when —no-score is specified (so we record what scorers / args were run with the original task) more storing cool lint
We were previously loading metrics from the scorers, but the list of metrics actually overrides the scorer metrics, so this resulted in scoring not producing metrics. This reads any metrics directly from the eval.metrics and uses those if present. Otherwise metrics associated with the scorers will be used.
loader no error
42ffea9
to
a7e40e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super nice! just a minor typo which I have flagged.
edit- please now ignore this comment, getting confused with metrics and scorers! Noting that I (and I know others) rarely use inspect score as we often combine many eval logs, filter for quality and then score, e.g. pass@k and variance metrics.
src/inspect_ai/_cli/score.py
Outdated
choices=["overwrite", "create", "o", "c"], | ||
default="create", | ||
) | ||
if file_action in ["overwrite", "0"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line should be:
if file_action in ["overwrite", "o"]:
current behaviour is that if the user provides 'o' they go down the 'create' flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! good catch!
This PR contains:
Unscored Evaluations
By default, model output in evaluations is automatically scored. However, you can separate generating completions and scoring by using the
--no-score
option. For example:inspect eval popularity.py --model openai/gpt-4 --no-score
This will produce a log with samples with trajectories that have not yet been scored and no evaluation metrics.
Tip
Using a distinct scoring step is particularly useful during scorer development, as it bypasses the entire generation phase, saving lots of time and inference costs.
Scoring Logs
You can score an evaluation previously run this way using the
inspect score
command:# score an unscored eval inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval
This will use the scorers and metrics that were declared when the evaluation was run, applying them to score each sample and generate metrics for the evaluation.
You may choose to use a different scorer than the task scorer to score a log file. In this case, you can use the
--scorer
command to pass the name of a scorer or the path to a source code file containing a scorer to use. For example:If you need to pass arguments to the scorer, you can do do using scorer args like so:
Overwriting Logs
When you use the
inspect score
command, you will prompted whether or not you'd like to overwrite the existing log files, adding the scores, or create a new scored log file. By default, the command will create a new log file with a-scored
suffix to disambiguate it from the original file. You may also control this using the--overwrite
flag like:# overwrite the log with scores from the task defined scorer inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --overwrite
Ovewriting Scores
When rescoring a previously scored log file (for instance, when using a new scoring system), you have two options:
You can choose which mode to use based on whether you want to preserve or discard the previous scoring data. To control this, use the
--action
arg:Using Python
You can also use the
score
function in your Python code to score evaluation logs.For example, if you are exploring the performance of different scorers, you might find it more useful to call the
score()
function using varying scorers or scorer options. For example:You can also use this function to score an existing log file (appending or overwriting results) like so: