Improve `inspect score` command #1293

dragonstyle · 2025-02-11T23:23:21Z

This PR contains:

Unscored Evaluations

By default, model output in evaluations is automatically scored. However, you can separate generating completions and scoring by using the --no-score option. For example:

inspect eval popularity.py --model openai/gpt-4 --no-score

This will produce a log with samples with trajectories that have not yet been scored and no evaluation metrics.

Tip

Using a distinct scoring step is particularly useful during scorer development, as it bypasses the entire generation phase, saving lots of time and inference costs.

Scoring Logs

You can score an evaluation previously run this way using the inspect score command:

# score an unscored eval
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval

This will use the scorers and metrics that were declared when the evaluation was run, applying them to score each sample and generate metrics for the evaluation.

You may choose to use a different scorer than the task scorer to score a log file. In this case, you can use the --scorer command to pass the name of a scorer or the path to a source code file containing a scorer to use. For example:

# use built in match scorer
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --scorer match

# use my custom scorer
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --scorer custom_scorer.py

# use a custom scorer named 'classify' in a file with more than one scorer
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --scorer custom_scorers.py@classify

If you need to pass arguments to the scorer, you can do do using scorer args like so:

# use built in match scorer
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --scorer match --S location="end"

Overwriting Logs

When you use the inspect score command, you will prompted whether or not you'd like to overwrite the existing log files, adding the scores, or create a new scored log file. By default, the command will create a new log file with a -scored suffix to disambiguate it from the original file. You may also control this using the --overwrite flag like:

# overwrite the log with scores from the task defined scorer
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --overwrite

Ovewriting Scores

When rescoring a previously scored log file (for instance, when using a new scoring system), you have two options:

Append Mode (Default): The new scores will be added alongside the existing scores in the log file, keeping both the old and new results.
Overwrite Mode: The new scores will replace the existing scores in the log file, removing the old results.

You can choose which mode to use based on whether you want to preserve or discard the previous scoring data. To control this, use the --action arg:

# append scores from custom scorer
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --scorer custom_scorer.py --action append

# overwrite scores with new scores from custom scorer
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --scorer custom_scorer.py --action overwrite

Using Python

You can also use the score function in your Python code to score evaluation logs.

For example, if you are exploring the performance of different scorers, you might find it more useful to call the score() function using varying scorers or scorer options. For example:

log = eval(popularity, model="openai/gpt-4")[0]

grader_models = [
    "openai/gpt-4",
    "anthropic/claude-3-opus-20240229",
    "google/gemini-1.0-pro",
    "mistral/mistral-large-latest"
]

scoring_logs = [score(log, model_graded_qa(model=model)) 
                for model in grader_models]

plot_results(scoring_logs)

You can also use this function to score an existing log file (appending or overwriting results) like so:

# read the log
input_log_path = "./logs/2025-02-11T15-17-00-05-00_popularity_dPiJifoWeEQBrfWsAopzWr.eval"
log = read_eval_log(input_log_path)

grader_models = [
    "openai/gpt-4",
    "anthropic/claude-3-opus-20240229",
    "google/gemini-1.0-pro",
    "mistral/mistral-large-latest"
]

# perform the scoring using various models
scoring_logs = [score(log, model_graded_qa(model=model), action="append") 
                for model in grader_models]

# write log files with the model name as a suffix
for model, scored_log in zip(grader_models, scoring_logs):
    base, ext = os.path.splitext(input_log_path)
    output_file = f"{base}_{model.replace('/', '_')}{ext}"
    write_eval_log(scored_log, output_file)

We currently record these in results with the expanded scores, but we need to record them for the case when —no-score is specified (so we record what scorers / args were run with the original task) more storing cool lint

We were previously loading metrics from the scorers, but the list of metrics actually overrides the scorer metrics, so this resulted in scoring not producing metrics. This reads any metrics directly from the eval.metrics and uses those if present. Otherwise metrics associated with the scorers will be used.

loader no error

hcoppockno10

Super nice! just a minor typo which I have flagged.

edit- please now ignore this comment, getting confused with metrics and scorers! Noting that I (and I know others) rarely use inspect score as we often combine many eval logs, filter for quality and then score, e.g. pass@k and variance metrics.

hcoppockno10 · 2025-02-12T16:48:50Z

src/inspect_ai/_cli/score.py

+                    choices=["overwrite", "create", "o", "c"],
+                    default="create",
+                )
+                if file_action in ["overwrite", "0"]:


This line should be:
if file_action in ["overwrite", "o"]:

current behaviour is that if the user provides 'o' they go down the 'create' flow.

Thank you! good catch!

dragonstyle requested a review from jjallaire February 11, 2025 23:23

Improved unscored sample score rendering

378532e

dragonstyle force-pushed the feature/score-task branch from ed315e8 to 42ffea9 Compare February 12, 2025 16:17

dragonstyle marked this pull request as ready for review February 12, 2025 16:17

dragonstyle added 19 commits February 12, 2025 11:33

Record scorer configuation with the eval

70c0b0c

We currently record these in results with the expanded scores, but we need to record them for the case when —no-score is specified (so we record what scorers / args were run with the original task) more storing cool lint

Don’t require task param in score command

f31e4be

Try loading scorers from the task file

2b9127a

Don’t throw if call doesn’t have return type

534348f

loader no error

Additional cleanup

a9b4624

Correct types

f440af7

Load scorers from the command line

dc620fd

add action to control append/overwrite

d8df5d5

Prompt for overwrite/append

5c8ec70

minor cleanup

5970604

Misc fine tuning

58c7c59

Improve results display

1dbcfa4

correct types

4230722

add tests

09c4f77

Add missing arg

3071bbe

fix lint

1c5ac24

Fix error

7952dee

add test files

a7e40e4

dragonstyle force-pushed the feature/score-task branch from 42ffea9 to a7e40e4 Compare February 12, 2025 16:34

hcoppockno10 reviewed Feb 12, 2025

View reviewed changes

correct typo

f3c3170

hcoppockno10 approved these changes Feb 12, 2025

View reviewed changes

jjallaire added 3 commits February 12, 2025 21:27

scoring workflow docs

909aa12

eliminate crossref conflicts

ed16966

eliminate crossref conflicts

6f4863f

update changelog

9761841

jjallaire approved these changes Feb 12, 2025

View reviewed changes

Merge branch 'main' into feature/score-task

1344c6d

jjallaire merged commit 25e3bdc into main Feb 12, 2025
9 checks passed

jjallaire deleted the feature/score-task branch February 12, 2025 21:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve `inspect score` command #1293

Improve `inspect score` command #1293

dragonstyle commented Feb 11, 2025 •

edited

Loading

hcoppockno10 left a comment •

edited

Loading

hcoppockno10 Feb 12, 2025

dragonstyle Feb 12, 2025

Improve inspect score command #1293

Improve inspect score command #1293

Conversation

dragonstyle commented Feb 11, 2025 • edited Loading

This PR contains:

Unscored Evaluations

Scoring Logs

Overwriting Logs

Ovewriting Scores

Using Python

hcoppockno10 left a comment • edited Loading

Choose a reason for hiding this comment

hcoppockno10 Feb 12, 2025

Choose a reason for hiding this comment

dragonstyle Feb 12, 2025

Choose a reason for hiding this comment

Improve `inspect score` command #1293

Improve `inspect score` command #1293

dragonstyle commented Feb 11, 2025 •

edited

Loading

hcoppockno10 left a comment •

edited

Loading