-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eval config eval #226
Closed
Closed
Eval config eval #226
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…l framework. Not up and running yet
…s to the EvalRun but working with tests for now.
I need to re-read the paper to check my math, but this is the right framework. Lots of tests because of all the potential edge cases. I've already seen some cool results averaging several values (but t=gpt 4o mini is certain quite a bit)
G-eval and LLM as judge evals
- Use our prompt_ids everywhere! - Make a new RunConfig which contains all info about running a model - Use RunConfig everywhere
Run config and prompt ID usage across codebase
- PromptID into the datamodel. - RunConfig into a task
Add tests and refactor TaskRunConfig. Move Prompt ID into datamodel.
Add a tag-based dataset filter.
Evals run model, and Dataset filter pydantic type
1 for evaluating the eval configs, which needs ratings 1 for running the eval
Eval results
Nicer evals UI for table
- Freeze non-frozen prompts into the task_run, so the evals are consisten - Expose frozen prompts via prompt UI
Much better prompt system for evals
- Want the evaluator to have some context on what the goal is. - Don't want to give it the prompt, as we're testing prompts, so it's biasing the evaluator - Instead, give a short task-desription, which is locked across the eval_config, so no bias for a given prompt.
Okay: fix the root of my 2 prompt issue
Add tests. Note: ran a lot more, these are the only ones that work. Fireworks only returns 5 logprobs (not enough). Ollama doesn't support logprobs. Amazon could work, but can do that later. Note: slightly ugly provider specific code leaking into the OAI compaible adapter. Okay for now but should limit this.
…as well as x product of eval_configs and task runs.
- Test were failing on CI from how we checked provider. Just use name now. - Don't specify extra OR parameters, unless needed for logprobs
… your score. Includes the ability to run the eval-config-eval.
- Improve strings/messaging - Allow creating eval configs from /eval_configs with correct redirect - Fix a bug where eval runs without task_run_configs were causing lookup errors.
…g, which tool the whole svelte componenet out of dom
Created for incorrect branch. See #228 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New system to evaluate the evaluators agains human baselines.