Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve inspect score command #1293

Merged
merged 26 commits into from
Feb 12, 2025
Merged

Improve inspect score command #1293

merged 26 commits into from
Feb 12, 2025

Conversation

dragonstyle
Copy link
Collaborator

@dragonstyle dragonstyle commented Feb 11, 2025

This PR contains:

  • New features
  • Changes to dev-tools e.g. CI config / github tooling
  • Docs
  • Bug fixes
  • Code refactor

Unscored Evaluations

By default, model output in evaluations is automatically scored. However, you can separate generating completions and scoring by using the --no-score option. For example:

inspect eval popularity.py --model openai/gpt-4 --no-score

This will produce a log with samples with trajectories that have not yet been scored and no evaluation metrics.

Tip

Using a distinct scoring step is particularly useful during scorer development, as it bypasses the entire generation phase, saving lots of time and inference costs.

Scoring Logs

You can score an evaluation previously run this way using the inspect score command:

# score an unscored eval
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval

This will use the scorers and metrics that were declared when the evaluation was run, applying them to score each sample and generate metrics for the evaluation.

You may choose to use a different scorer than the task scorer to score a log file. In this case, you can use the --scorer command to pass the name of a scorer or the path to a source code file containing a scorer to use. For example:

# use built in match scorer
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --scorer match

# use my custom scorer
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --scorer custom_scorer.py

# use a custom scorer named 'classify' in a file with more than one scorer
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --scorer custom_scorers.py@classify

If you need to pass arguments to the scorer, you can do do using scorer args like so:

# use built in match scorer
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --scorer match --S location="end"

Overwriting Logs

When you use the inspect score command, you will prompted whether or not you'd like to overwrite the existing log files, adding the scores, or create a new scored log file. By default, the command will create a new log file with a -scored suffix to disambiguate it from the original file. You may also control this using the --overwrite flag like:

# overwrite the log with scores from the task defined scorer
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --overwrite

Ovewriting Scores

When rescoring a previously scored log file (for instance, when using a new scoring system), you have two options:

  1. Append Mode (Default): The new scores will be added alongside the existing scores in the log file, keeping both the old and new results.
  2. Overwrite Mode: The new scores will replace the existing scores in the log file, removing the old results.

You can choose which mode to use based on whether you want to preserve or discard the previous scoring data. To control this, use the --action arg:

# append scores from custom scorer
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --scorer custom_scorer.py --action append

# overwrite scores with new scores from custom scorer
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --scorer custom_scorer.py --action overwrite

Using Python

You can also use the score function in your Python code to score evaluation logs.

For example, if you are exploring the performance of different scorers, you might find it more useful to call the score() function using varying scorers or scorer options. For example:

log = eval(popularity, model="openai/gpt-4")[0]

grader_models = [
    "openai/gpt-4",
    "anthropic/claude-3-opus-20240229",
    "google/gemini-1.0-pro",
    "mistral/mistral-large-latest"
]

scoring_logs = [score(log, model_graded_qa(model=model)) 
                for model in grader_models]

plot_results(scoring_logs)

You can also use this function to score an existing log file (appending or overwriting results) like so:

# read the log
input_log_path = "./logs/2025-02-11T15-17-00-05-00_popularity_dPiJifoWeEQBrfWsAopzWr.eval"
log = read_eval_log(input_log_path)

grader_models = [
    "openai/gpt-4",
    "anthropic/claude-3-opus-20240229",
    "google/gemini-1.0-pro",
    "mistral/mistral-large-latest"
]

# perform the scoring using various models
scoring_logs = [score(log, model_graded_qa(model=model), action="append") 
                for model in grader_models]

# write log files with the model name as a suffix
for model, scored_log in zip(grader_models, scoring_logs):
    base, ext = os.path.splitext(input_log_path)
    output_file = f"{base}_{model.replace('/', '_')}{ext}"
    write_eval_log(scored_log, output_file)

@dragonstyle dragonstyle marked this pull request as ready for review February 12, 2025 16:17
We currently record these in results with the expanded scores, but we need to record them for the case when —no-score is specified (so we record what scorers / args were run with the original task)

more storing

cool

lint
We were previously loading metrics from the scorers, but the list of metrics actually overrides the scorer metrics, so this resulted in scoring not producing metrics.

This reads any metrics directly from the eval.metrics and uses those if present. Otherwise metrics associated with the scorers will be used.
Copy link

@hcoppockno10 hcoppockno10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nice! just a minor typo which I have flagged.

edit- please now ignore this comment, getting confused with metrics and scorers! Noting that I (and I know others) rarely use inspect score as we often combine many eval logs, filter for quality and then score, e.g. pass@k and variance metrics.

choices=["overwrite", "create", "o", "c"],
default="create",
)
if file_action in ["overwrite", "0"]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line should be:
if file_action in ["overwrite", "o"]:

current behaviour is that if the user provides 'o' they go down the 'create' flow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! good catch!

@jjallaire jjallaire merged commit 25e3bdc into main Feb 12, 2025
9 checks passed
@jjallaire jjallaire deleted the feature/score-task branch February 12, 2025 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants