Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evidence - FeedbackEvaluator Storage & Basic UI #12758

Merged
merged 55 commits into from
Feb 5, 2025
Merged
Changes from 1 commit
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
55bd451
wip
dandrabik Jan 8, 2025
c391c58
Wip
dandrabik Jan 10, 2025
b16de96
Working version of rule trial.
dandrabik Jan 10, 2025
58c4c2f
Extract examples out of markdown and into code to make easier to edit.
dandrabik Jan 10, 2025
4d6ab0a
Add String extensions to count questions.
dandrabik Jan 13, 2025
5d2c9a6
WIP for regex version of prompt checker.
dandrabik Jan 13, 2025
6c62072
Add some initial regex checks.
dandrabik Jan 15, 2025
971d6a8
wip
dandrabik Jan 17, 2025
a75efd8
Prompt that flags a good number of examples.
dandrabik Jan 17, 2025
071fbd3
More iteration.
dandrabik Jan 17, 2025
96b6148
Remove unused code for now.
dandrabik Jan 17, 2025
48109ad
Extract concern from base class.
dandrabik Jan 17, 2025
611e922
Move more out of the base class.
dandrabik Jan 17, 2025
724178c
Clean up unused files.
dandrabik Jan 19, 2025
c11e6fc
Add new data files.
dandrabik Jan 22, 2025
3b0db29
Update Verbose checker to be looser.
dandrabik Jan 22, 2025
9f24bcd
Add more datasets
dandrabik Jan 22, 2025
804a6ec
Spec wip.
dandrabik Jan 22, 2025
9959c1f
Fix specs, update Scalpel code for more edge cases.
dandrabik Jan 23, 2025
64d12a1
Add some basic tests.
dandrabik Jan 23, 2025
9c6b146
Lint
dandrabik Jan 23, 2025
cd41170
Lint and small refactors.
dandrabik Jan 23, 2025
2d8be04
Rename Scalpel to better name since I’ve edited it a bunch already.
dandrabik Jan 23, 2025
0647b7e
Lint.
dandrabik Jan 23, 2025
26e5123
Don’t modify String from an Engine (rethinking this is bad form).
dandrabik Jan 23, 2025
9fbbe43
Lint.
dandrabik Jan 23, 2025
fae954f
Fix script, delete unused file, lint.
dandrabik Jan 23, 2025
574f4b5
Code Cleanup.
dandrabik Jan 23, 2025
1d9e902
Initial working version of Feedback Evaluation.
dandrabik Jan 30, 2025
79a72e3
Update controller endpoints for frontend use.
dandrabik Jan 30, 2025
77183a5
Merge branch 'develop' into feedback_evaluator_storage
dandrabik Jan 30, 2025
e209a02
Whitespace to trigger build.
dandrabik Jan 31, 2025
9ec3bb7
Add creator spec.
dandrabik Jan 31, 2025
2c0dc02
Add backstop tests.
dandrabik Jan 31, 2025
d2f6308
Self review cleanup.
dandrabik Jan 31, 2025
a896d1e
Clean up schema.
dandrabik Jan 31, 2025
4121cf2
Rename method.
dandrabik Jan 31, 2025
1841341
Merge branch 'develop' into feedback_evaluator_storage
dandrabik Jan 31, 2025
fc2ac78
Working version of error stats on trial index view.
dandrabik Jan 31, 2025
b930449
Working version of showing errors.
dandrabik Jan 31, 2025
3854f19
Add basic flag to llm_example.
dandrabik Jan 31, 2025
c810451
Update snap tests.
dandrabik Feb 3, 2025
e478781
Check out snap from develop.
dandrabik Feb 3, 2025
1c58625
Add feedback errors to old UI.
dandrabik Feb 3, 2025
741031a
PR feedback.
dandrabik Feb 3, 2025
bd619c1
Merge branch 'develop' into feedback_evaluator_storage
dandrabik Feb 3, 2025
f0175fd
Use develops snapshot.
dandrabik Feb 3, 2025
2c9d46a
Fix jest test.
dandrabik Feb 3, 2025
d5f022b
Add another test, update set_evaluator_counts method.
dandrabik Feb 4, 2025
e3e1ac1
Fix test stub with missing created_at attribute.
dandrabik Feb 4, 2025
b9a13bc
Add eager loading to comparison page so it loads.
dandrabik Feb 4, 2025
727ac1f
Fix race condition in update_results with transaction.
dandrabik Feb 5, 2025
fba1fcc
Fixing up tests.
dandrabik Feb 5, 2025
7fb3efd
Add save before with_lock.
dandrabik Feb 5, 2025
f41cc08
Merge branch 'develop' into feedback_evaluator_storage
dandrabik Feb 5, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Working version of error stats on trial index view.
  • Loading branch information
dandrabik committed Jan 31, 2025

Verified

This commit was signed with the committer’s verified signature.
sergio-costas Sergio Costas
commit fc2ac784702d2d27dbe6f41d32120dce3c09bce3
Original file line number Diff line number Diff line change
@@ -104,6 +104,14 @@ const TrialsSection = ({ trials, datasetPath, }: { trials: TrialInterface[], dat
rowSectionClassName: 'center-content allow-wrap',
noTooltip: true
},
{
name: 'Feedback Error Rate',
attribute: 'feedbackErrorRate',
width: '64px',
headerClassName: 'center-content',
rowSectionClassName: 'center-content allow-wrap',
noTooltip: true
},
{
name: 'LLM',
attribute: 'llmVersion',
@@ -151,7 +159,7 @@ const TrialsSection = ({ trials, datasetPath, }: { trials: TrialInterface[], dat
]

const rows = () => trials.map(trial => {
const { number, created_at, temperature, optimal_correct, optimal_count, suboptimal_correct, suboptimal_count, average_g_eval_score, status, id, notes, llm_version, llm_prompt, } = trial
const { number, created_at, temperature, optimal_correct, optimal_count, suboptimal_correct, suboptimal_count, average_g_eval_score, status, id, notes, llm_version, llm_prompt, evaluator_failure_count, evaluator_total_count } = trial
const { name, optimal_examples_count, suboptimal_examples_count, guidelines_count, } = llm_prompt

let compareCheckbox = <button aria-label="Unchecked checkbox" className="quill-checkbox unselected" onClick={() => toggleTrialSelection(id)} type="button" />
@@ -173,6 +181,7 @@ const TrialsSection = ({ trials, datasetPath, }: { trials: TrialInterface[], dat
guidelinesCount: guidelines_count,
optimalAccuracy: percentAccuracy(optimal_correct, optimal_count),
suboptimalAccuracy: percentAccuracy(suboptimal_correct, suboptimal_count),
feedbackErrorRate: evaluator_failure_count === null ? null : percentAccuracy(evaluator_failure_count, evaluator_total_count),
llmVersion: llm_version,
averageGEvalScore: average_g_eval_score,
notes,
Original file line number Diff line number Diff line change
@@ -211,6 +211,8 @@ export interface TrialInterface {
optimal_count: number;
suboptimal_correct: number;
suboptimal_count: number;
evaluator_failure_count?: number;
evaluator_total_count?: number;
average_g_eval_score: number;
llm_version: number;
llm_prompt?: LLMPromptInterface;
Original file line number Diff line number Diff line change
@@ -85,7 +85,7 @@ def serializable_hash(options = nil)
options ||= {}
super(options.reverse_merge(
include: [:llm_prompt],
methods: [:average_g_eval_score, :optimal_correct, :optimal_count, :suboptimal_correct, :suboptimal_count, :llm_version, :vendor, :test_examples_count]
methods: [:average_g_eval_score, :optimal_correct, :optimal_count, :suboptimal_correct, :suboptimal_count, :llm_version, :vendor, :test_examples_count, :evaluator_failure_count, :evaluator_total_count]
))
end

Original file line number Diff line number Diff line change
@@ -88,6 +88,7 @@
<th>Optimal Accuracy</th>
<th>Suboptimal Accuracy</th>
<th>Weighted Accuracy</th>
<th>Feedback Error Rate</th>
<th>Model</th>
<% if @dataset.generative? %>
<th>GEval Average</th>
@@ -119,6 +120,7 @@
<td><%= percent_accuracy(trial.optimal_correct, trial.optimal_count) %></td>
<td><%= percent_accuracy(trial.suboptimal_correct, trial.suboptimal_count) %></td>
<td><%= trial.weighted_accuracy&.round(5) %></td>
<td><%= percent_accuracy(trial.evaluator_failure_count, trial.evaluator_total_count) if trial.evaluator_failure_count %></td>
<td><%= trial.llm.version %></td>
<% if @dataset.generative? %>
<td><%= trial.average_g_eval_score %></td>