You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
New prompts should be tested to evaluate their performance and minimise unexpected issues in production. This will likely involve accumulating generated test datasets targeting different issues, as well as using LLM-based evaluation to check if each test passed (T/F) to produce a score.
The text was updated successfully, but these errors were encountered:
For the record I would be happy with a manual test process which goes something like this:
Run a bunch of test questions through the assistant
save the responses to a file and check them into the repo
Make a change to the prompt
Re-run the test questions and MANUALLY review diffs in the answers
Check in the updated answers if we're happy
We may also need to factor in drift from the LLM end itself - as eg Anthropic updates its model, I don't know how tightly we can version lock, so we may see a natural variance.
New prompts should be tested to evaluate their performance and minimise unexpected issues in production. This will likely involve accumulating generated test datasets targeting different issues, as well as using LLM-based evaluation to check if each test passed (T/F) to produce a score.
The text was updated successfully, but these errors were encountered: