-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
568 wire eval lite #616
Merged
Merged
568 wire eval lite #616
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit integrates eval_lite as a separate, new job. This is done to make sure that while we add this, everything still works as expected. Once everything is tested and found in good status, we can deprecate the older evaluator-based approach in favor of the two inference + evaluation jobs and remove the extra code. TODOs have been added with the purpose of finding the parts of code we need to remove more easily
When max_samples was -1, a bug assigned the value -1 to max_samples (instead of the dataset size) creating an error in the following `range` command. This has been fixed by setting max_samples to len(dataset) whenever max_samples < 1 or > len(dataset).
…into 568-wire-eval-lite
veekaybee
reviewed
Jan 14, 2025
Tested and working for me via API calls, just a note that make test-all only works if the app is down, not if containers are already running |
veekaybee
reviewed
Jan 14, 2025
veekaybee
reviewed
Jan 14, 2025
veekaybee
reviewed
Jan 14, 2025
veekaybee
reviewed
Jan 14, 2025
veekaybee
approved these changes
Jan 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after addressing a few comments, tested and working as speced!
javiermtorres
approved these changes
Jan 15, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What's changing
Wires the new eval_lite job into the
experiments_new
workflow. As a result, our new experiments now hit an inference job first, followedby a "light" evaluation job which calculates evaluation metrics from provided ground_truth and predictions.
The PR includes a few fixes to eval_lite (tested locally), deps cleanup, and refactoring existing code to add a light evaluation
JobType
so that we can treat it separately from the existing evaluation and have both work for some time (until we are sure we are doing ok with the new eval and choose to remove evaluator code).Refs #568
How to test it
You can directly hit the
evaluation_new
endpoint by passing a dataset which already has both original samples and ground truth. Here are two scripts you can run to upload a dataset (dialogsum.csv
here) and run the multi-job experiment:test_upload_dataset
test_experiment_mistral
I tested with the above scripts (relying on mistral API) and ran unit+integration tests locally (
make test-all
) pulling up the whole system withlocal-up
.Additional notes for reviewers
Most of the changes involve adding an extra jobtype - this will be removed at some point, so I left plenty of comments / TODOs to simplify the refactoring.
I know this PR is already including some integration tests on
experiments_new
so I am not adding them in this PR too.I have not updated docs as this endpoint + service are going to become
/experiments
soon and when they do we'll directly update docs related to this component.The new
/api/v1/jobs/eval_lite/
endpoint can also be tested from the API by passing a config like:where
model
is a model name which is just added to the output json file, and dataset is the ID of a valid dataset which already contains predictions and ground_truth fields (one such dataset is generated by the inference endpoint).I already...
/docs
)