568 wire eval lite #616

aittalam · 2025-01-14T07:53:28Z

What's changing

Wires the new eval_lite job into the experiments_new workflow. As a result, our new experiments now hit an inference job first, followed
by a "light" evaluation job which calculates evaluation metrics from provided ground_truth and predictions.

The PR includes a few fixes to eval_lite (tested locally), deps cleanup, and refactoring existing code to add a light evaluation JobType so that we can treat it separately from the existing evaluation and have both work for some time (until we are sure we are doing ok with the new eval and choose to remove evaluator code).

Refs #568

How to test it

You can directly hit the evaluation_new endpoint by passing a dataset which already has both original samples and ground truth. Here are two scripts you can run to upload a dataset (dialogsum.csv here) and run the multi-job experiment:

test_upload_dataset

#!/bin/bash
if [ "$#" -gt 0 ]; then
    DATA_CSV_PATH="$1"
else
    DATA_CSV_PATH="$HOME/Downloads/dialogsum.csv"
fi

if [[ -z "${BACKEND_URL}" ]]; then
  BACKEND_URL=http://localhost:8000
fi

echo Connecting to $BACKEND_URL...

curl -s $BACKEND_URL/api/v1/datasets/ \
  -H 'Accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F 'dataset=@'"$DATA_CSV_PATH"';type=text/csv' \
  -F 'format=job' | jq

test_experiment_mistral

if [[ -z "${BACKEND_URL}" ]]; then
  BACKEND_URL=http://localhost:8000
fi

DATASET_ID=$(curl -s $BACKEND_URL/api/v1/datasets/ | jq -r '.items |sort_by(.created_at) | reverse | .[0].id')

EVAL_NAME="test_experiment_mistral"
EVAL_DESC="Test experiment (inference + eval) with Mistral API"
EVAL_MODEL="mistral://open-mistral-7b"
EVAL_DATASET=$DATASET_ID
EVAL_MAX_SAMPLES="10"

JSON_STRING=$(jq -n \
                --arg name "$EVAL_NAME" \
                --arg desc "$EVAL_DESC" \
                --arg model "$EVAL_MODEL" \
                --arg dataset_id "$EVAL_DATASET" \
                --arg max_samples "$EVAL_MAX_SAMPLES" \
                '{name: $name, description: $desc, model: $model, dataset: $dataset_id, max_samples: $max_samples}' )

echo Connecting to $BACKEND_URL...

curl -s $BACKEND_URL/api/v1/experiments_new/ \
  -H 'Accept: application/json' \
  -H 'Content-Type: application/json' \
  -d "$JSON_STRING" | jq

I tested with the above scripts (relying on mistral API) and ran unit+integration tests locally (make test-all) pulling up the whole system with local-up.

Additional notes for reviewers

Most of the changes involve adding an extra jobtype - this will be removed at some point, so I left plenty of comments / TODOs to simplify the refactoring.

I know this PR is already including some integration tests on experiments_new so I am not adding them in this PR too.

I have not updated docs as this endpoint + service are going to become /experiments soon and when they do we'll directly update docs related to this component.

The new /api/v1/jobs/eval_lite/ endpoint can also be tested from the API by passing a config like:

{
  "name": "Testing eval_lite",
  "model": "mistral://open-mistral-7b",
  "dataset": "deaddead-dead-dead-dead-deaddeaddead"
}

where model is a model name which is just added to the output json file, and dataset is the ID of a valid dataset which already contains predictions and ground_truth fields (one such dataset is generated by the inference endpoint).

I already...

Tested the changes in a working environment to ensure they work as expected
Added some tests for any new functionality
Updated the documentation (both comments in code and product documentation under /docs)
Checked if a (backend) DB migration step was required and included it if required : no db migration needed here

This commit integrates eval_lite as a separate, new job. This is done to make sure that while we add this, everything still works as expected. Once everything is tested and found in good status, we can deprecate the older evaluator-based approach in favor of the two inference + evaluation jobs and remove the extra code. TODOs have been added with the purpose of finding the parts of code we need to remove more easily

When max_samples was -1, a bug assigned the value -1 to max_samples (instead of the dataset size) creating an error in the following `range` command. This has been fixed by setting max_samples to len(dataset) whenever max_samples < 1 or > len(dataset).

…into 568-wire-eval-lite

docker-compose.yaml

veekaybee · 2025-01-14T16:30:38Z

Tested and working for me via API calls, just a note that make test-all only works if the app is down, not if containers are already running

lumigator/python/mzai/backend/backend/services/jobs.py

lumigator/python/mzai/backend/backend/settings.py

lumigator/python/mzai/jobs/evaluator_lite/eval_lite.py

lumigator/python/mzai/jobs/evaluator_lite/requirements.txt

veekaybee

LGTM after addressing a few comments, tested and working as speced!

…into 568-wire-eval-lite

aittalam and others added 4 commits January 8, 2025 10:42

Minor fixes to run the tool from the command line

a14e60b

Cleaned + updated deps in requirements.txt

ce9ad4c

Merge branch 'main' into 568-wire-eval-lite

f868441

github-actions bot added backend api Changes which impact API/presentation layer schemas Changes to schemas (which may be public facing) labels Jan 14, 2025

Merge branch 'main' into 568-wire-eval-lite

a6de385

aittalam requested review from veekaybee, dpoulopoulos, ividal and javiermtorres January 14, 2025 13:20

aittalam added 2 commits January 14, 2025 14:56

Fixed eval-lite max_samples bug

9ef8970

When max_samples was -1, a bug assigned the value -1 to max_samples (instead of the dataset size) creating an error in the following `range` command. This has been fixed by setting max_samples to len(dataset) whenever max_samples < 1 or > len(dataset).

Merge branch '568-wire-eval-lite' of github.com:mozilla-ai/lumigator …

31e3d87

…into 568-wire-eval-lite

aittalam marked this pull request as ready for review January 14, 2025 15:03