Eval config eval #226

scosman · 2025-02-26T15:42:47Z

New system to evaluate the evaluators agains human baselines.

…l framework. Not up and running yet

…nd validator

…s to the EvalRun but working with tests for now.

I need to re-read the paper to check my math, but this is the right framework. Lots of tests because of all the potential edge cases. I've already seen some cool results averaging several values (but t=gpt 4o mini is certain quite a bit)

G-eval and LLM as judge evals

- Use our prompt_ids everywhere! - Make a new RunConfig which contains all info about running a model - Use RunConfig everywhere

…ecking

Run config and prompt ID usage across codebase

- PromptID into the datamodel. - RunConfig into a task

Add tests and refactor TaskRunConfig. Move Prompt ID into datamodel.

Add a tag-based dataset filter.

Evals run model, and Dataset filter pydantic type

1 for evaluating the eval configs, which needs ratings 1 for running the eval

… tests

Eval results

Eval UI

Nicer evals UI for table

- Freeze non-frozen prompts into the task_run, so the evals are consisten - Expose frozen prompts via prompt UI

Much better prompt system for evals

- Want the evaluator to have some context on what the goal is. - Don't want to give it the prompt, as we're testing prompts, so it's biasing the evaluator - Instead, give a short task-desription, which is locked across the eval_config, so no bias for a given prompt.

Okay: fix the root of my 2 prompt issue

Add tests. Note: ran a lot more, these are the only ones that work. Fireworks only returns 5 logprobs (not enough). Ollama doesn't support logprobs. Amazon could work, but can do that later. Note: slightly ugly provider specific code leaking into the OAI compaible adapter. Okay for now but should limit this.

…as well as x product of eval_configs and task runs.

- Test were failing on CI from how we checked provider. Just use name now. - Don't specify extra OR parameters, unless needed for logprobs

… your score. Includes the ability to run the eval-config-eval.

- Improve strings/messaging - Allow creating eval configs from /eval_configs with correct redirect - Fix a bug where eval runs without task_run_configs were causing lookup errors.

…g, which tool the whole svelte componenet out of dom

scosman · 2025-02-26T19:55:19Z

Created for incorrect branch. See #228

scosman added 30 commits February 12, 2025 14:55

New datamodel for evals w tests.

c17f9d9

title to json key function w tests

f5596e2

checkpoint of g_eval work, has working json_schema output, and initia…

3aa608e

…l framework. Not up and running yet

Refactor our prompt ID system to 1 uniform ID, with a pydantic type a…

92f7bcc

…nd validator

Add a prompt serialization in the eval config model. I might move thi…

0055af9

…s to the EvalRun but working with tests for now.

Add in progress eval adaptor, and g_eval implementation

56f7e08

G-evals are working with tests!!

8c015f3

I need to re-read the paper to check my math, but this is the right framework. Lots of tests because of all the potential edge cases. I've already seen some cool results averaging several values (but t=gpt 4o mini is certain quite a bit)

Add LLM as Judge evaluator

41c0e45

Add comment

107f598

Fix python 3.10 issue, and update cursor rules with 3.10+

e5bd880

Remove TODOs

d925443

CR feedback

76ee204

Remove unused import

d4fad9e

Merge pull request #200 from Kiln-AI/g_eval

dd32153

G-eval and LLM as judge evals

Big change:

4232eb1

- Use our prompt_ids everywhere! - Make a new RunConfig which contains all info about running a model - Use RunConfig everywhere

Remove console.log

446fafe

Better API typing: check for valid PromptId using pydantic types

6e72cf5

Update prompt_builder_from_id to take a typed string for extra typech…

9d73952

…ecking

Remove duplicate data storage inside the adapter

b261e4e

Fix ID parsing

e0ab86c

Merge pull request #204 from Kiln-AI/evals_run_config

886d550

Run config and prompt ID usage across codebase

Fix prompt name UI

781c66f

Refactoring:

24c8348

- PromptID into the datamodel. - RunConfig into a task

Add tests, and refactor the run_config files

33e60ba

Merge pull request #205 from Kiln-AI/evals_run_config_tests

04964b6

Add tests and refactor TaskRunConfig. Move Prompt ID into datamodel.

New eval_run data structure

cc8daa8

Better pydantic typing for dataset filters, similar to promptIDs.

6542391

Add a tag-based dataset filter.

Merge pull request #209 from Kiln-AI/evals_runner

d593572

Evals run model, and Dataset filter pydantic type

Add datasets to evals:

de6dff7

1 for evaluating the eval configs, which needs ratings 1 for running the eval

Add a fancy async evaluation runner. Not complete but checkpoint with…

cfa6955

… tests

scosman added 27 commits February 23, 2025 11:17

CR feedback: better names and strings

13755c7

Merge pull request #219 from Kiln-AI/eval_results

fcb75be

Eval results

Remove TODO, not needed

55532b1

Merge pull request #218 from Kiln-AI/eval_ui

2bd73c1

Eval UI

Nicer evals UI for tablewq

24463a9

Merge pull request #220 from Kiln-AI/eval_ui

461a74c

Nicer evals UI for table

Much better prompt system for evals

e0a5532

- Freeze non-frozen prompts into the task_run, so the evals are consisten - Expose frozen prompts via prompt UI

CR feedback

e3a6a27

improve comment

a46b942

Merge pull request #221 from Kiln-AI/better_prompts

0759fb2

Much better prompt system for evals

Merge pull request #222 from Kiln-AI/better_prompts

bbb8e27

Okay: fix the root of my 2 prompt issue

UI improvements

0af1cdf

New UI: detailed results screen

f0d4144

Add eval config comparison summary API

7e51c3e

WIP UI for evaluating eval configs

113475c

Eval runner updated to be more powerful. Run a eval_config analysis, …

a8bb4db

…as well as x product of eval_configs and task runs.

Fix bug in how we collected runs

f6dec21

Fix 2 issues:

ee1318e

- Test were failing on CI from how we checked provider. Just use name now. - Don't specify extra OR parameters, unless needed for logprobs

Fully functionaly UI for finding the eval-config which works best for…

1133e1a

… your score. Includes the ability to run the eval-config-eval.

- All setting current eval config for an eval through UI

50811b1

- Improve strings/messaging - Allow creating eval configs from /eval_configs with correct redirect - Fix a bug where eval runs without task_run_configs were causing lookup errors.

Add 2 new scores: normalized MSE and MAE

a493ccd

Improved UI for config eval comparisons

43eb784

More improve copy/UI.

36c064d

Fix issue where the run_eval progress disappeared. We triggerd loadin…

935092f

…g, which tool the whole svelte componenet out of dom

String changes, final CR feedback

ee30223

scosman closed this Feb 26, 2025

scosman deleted the eval_config_eval branch February 26, 2025 19:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval config eval #226

Eval config eval #226

scosman commented Feb 26, 2025

scosman commented Feb 26, 2025

Eval config eval #226

Eval config eval #226

Conversation

scosman commented Feb 26, 2025

scosman commented Feb 26, 2025