Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

568 wire eval lite #616

Merged
merged 10 commits into from
Jan 15, 2025
Merged

568 wire eval lite #616

merged 10 commits into from
Jan 15, 2025

Conversation

aittalam
Copy link
Member

@aittalam aittalam commented Jan 14, 2025

What's changing

Wires the new eval_lite job into the experiments_new workflow. As a result, our new experiments now hit an inference job first, followed
by a "light" evaluation job which calculates evaluation metrics from provided ground_truth and predictions.

The PR includes a few fixes to eval_lite (tested locally), deps cleanup, and refactoring existing code to add a light evaluation JobType so that we can treat it separately from the existing evaluation and have both work for some time (until we are sure we are doing ok with the new eval and choose to remove evaluator code).

Refs #568

How to test it

You can directly hit the evaluation_new endpoint by passing a dataset which already has both original samples and ground truth. Here are two scripts you can run to upload a dataset (dialogsum.csv here) and run the multi-job experiment:

test_upload_dataset

#!/bin/bash
if [ "$#" -gt 0 ]; then
    DATA_CSV_PATH="$1"
else
    DATA_CSV_PATH="$HOME/Downloads/dialogsum.csv"
fi

if [[ -z "${BACKEND_URL}" ]]; then
  BACKEND_URL=http://localhost:8000
fi

echo Connecting to $BACKEND_URL...

curl -s $BACKEND_URL/api/v1/datasets/ \
  -H 'Accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F 'dataset=@'"$DATA_CSV_PATH"';type=text/csv' \
  -F 'format=job' | jq

test_experiment_mistral

if [[ -z "${BACKEND_URL}" ]]; then
  BACKEND_URL=http://localhost:8000
fi

DATASET_ID=$(curl -s $BACKEND_URL/api/v1/datasets/ | jq -r '.items |sort_by(.created_at) | reverse | .[0].id')

EVAL_NAME="test_experiment_mistral"
EVAL_DESC="Test experiment (inference + eval) with Mistral API"
EVAL_MODEL="mistral://open-mistral-7b"
EVAL_DATASET=$DATASET_ID
EVAL_MAX_SAMPLES="10"

JSON_STRING=$(jq -n \
                --arg name "$EVAL_NAME" \
                --arg desc "$EVAL_DESC" \
                --arg model "$EVAL_MODEL" \
                --arg dataset_id "$EVAL_DATASET" \
                --arg max_samples "$EVAL_MAX_SAMPLES" \
                '{name: $name, description: $desc, model: $model, dataset: $dataset_id, max_samples: $max_samples}' )

echo Connecting to $BACKEND_URL...

curl -s $BACKEND_URL/api/v1/experiments_new/ \
  -H 'Accept: application/json' \
  -H 'Content-Type: application/json' \
  -d "$JSON_STRING" | jq

I tested with the above scripts (relying on mistral API) and ran unit+integration tests locally (make test-all) pulling up the whole system with local-up.

Additional notes for reviewers

Most of the changes involve adding an extra jobtype - this will be removed at some point, so I left plenty of comments / TODOs to simplify the refactoring.

I know this PR is already including some integration tests on experiments_new so I am not adding them in this PR too.

I have not updated docs as this endpoint + service are going to become /experiments soon and when they do we'll directly update docs related to this component.

The new /api/v1/jobs/eval_lite/ endpoint can also be tested from the API by passing a config like:

{
  "name": "Testing eval_lite",
  "model": "mistral://open-mistral-7b",
  "dataset": "deaddead-dead-dead-dead-deaddeaddead"
}

where model is a model name which is just added to the output json file, and dataset is the ID of a valid dataset which already contains predictions and ground_truth fields (one such dataset is generated by the inference endpoint).

I already...

  • Tested the changes in a working environment to ensure they work as expected
  • Added some tests for any new functionality
  • Updated the documentation (both comments in code and product documentation under /docs)
  • Checked if a (backend) DB migration step was required and included it if required : no db migration needed here

aittalam and others added 4 commits January 8, 2025 10:42
This commit integrates eval_lite as a separate, new job.
This is done to make sure that while we add this, everything still works
as expected. Once everything is tested and found in good status, we can
deprecate the older evaluator-based approach in favor of the two
inference + evaluation jobs and remove the extra code.

TODOs have been added with the purpose of finding the parts of code
we need to remove more easily
@github-actions github-actions bot added backend api Changes which impact API/presentation layer schemas Changes to schemas (which may be public facing) labels Jan 14, 2025
When max_samples was -1, a bug assigned the value -1 to max_samples
(instead of the dataset size) creating an error in the following `range`
command. This has been fixed by setting max_samples to len(dataset)
whenever max_samples < 1 or > len(dataset).
@aittalam aittalam marked this pull request as ready for review January 14, 2025 15:03
@veekaybee
Copy link
Contributor

Tested and working for me via API calls, just a note that make test-all only works if the app is down, not if containers are already running

Copy link
Contributor

@veekaybee veekaybee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after addressing a few comments, tested and working as speced!

@aittalam aittalam merged commit 207d39c into main Jan 15, 2025
9 checks passed
@aittalam aittalam deleted the 568-wire-eval-lite branch January 15, 2025 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Changes which impact API/presentation layer backend schemas Changes to schemas (which may be public facing)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants