Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eval config eval #226

Closed
wants to merge 81 commits into from
Closed

Eval config eval #226

wants to merge 81 commits into from

Conversation

scosman
Copy link
Collaborator

@scosman scosman commented Feb 26, 2025

New system to evaluate the evaluators agains human baselines.

…s to the EvalRun but working with tests for now.
I need to re-read the paper to check my math, but this is the right framework.

Lots of tests because of all the potential edge cases. I've already seen some cool results averaging several values (but t=gpt 4o mini is certain quite a bit)
G-eval and LLM as judge evals
 - Use our prompt_ids everywhere!
 - Make a new RunConfig which contains all info about running a model
 - Use RunConfig everywhere
Run config and prompt ID usage across codebase
- PromptID into the datamodel.
- RunConfig into a task
Add tests and refactor TaskRunConfig. Move Prompt ID into datamodel.
Evals run model, and Dataset filter pydantic type
1 for evaluating the eval configs, which needs ratings
1 for running the eval
 - Freeze non-frozen prompts into the task_run, so the evals are consisten
 - Expose frozen prompts via prompt UI
Much better prompt system for evals
 - Want the evaluator to have some context on what the goal is.
 - Don't want to give it the prompt, as we're testing prompts, so it's biasing the evaluator
 - Instead, give a short task-desription, which is locked across the eval_config, so no bias for a given prompt.
Okay: fix the root of my 2 prompt issue
Add tests.

Note: ran a lot more, these are the only ones that work. Fireworks only returns 5 logprobs (not enough). Ollama doesn't support logprobs. Amazon could work, but can do that later.

Note: slightly ugly provider specific code leaking into the OAI compaible adapter. Okay for now but should limit this.
…as well as x product of eval_configs and task runs.
 - Test were failing on CI from how we checked provider. Just use name now.
 - Don't specify extra OR parameters, unless needed for logprobs
… your score. Includes the ability to run the eval-config-eval.
- Improve strings/messaging
- Allow creating eval configs from /eval_configs with correct redirect
- Fix a bug where eval runs without task_run_configs were causing lookup errors.
…g, which tool the whole svelte componenet out of dom
@scosman scosman closed this Feb 26, 2025
@scosman
Copy link
Collaborator Author

scosman commented Feb 26, 2025

Created for incorrect branch. See #228

@scosman scosman deleted the eval_config_eval branch February 26, 2025 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant