Humans can annotate the results (e.g. great, ok, bad)
See https://doccano.github.io/doccano/ for installation.
- Create a new project e.g.
Area Summary
- Create the tags e.g.
Good
,Bad
,Neutral
. - We want these options for the project:
- Allow single label.
- Randomize Document order
- Share annotations across users
We want the data in the following format for doccano:
[
{
"context": "Terrible customer service.",
"response": ["negative"]
},
{
"context": "Really great transaction.",
"response": ["positive"]
},
...
]
Here is an example of how to convert an unlabeled dataset for import into doccano. This creates a single input file from all model outputs merged together. The results are randomized in doccano and can be rated blind:
cat model_outputs/area_summary/*/*.yaml | python script/import.py > model_outputs/area_summary/dataset.json
Create labels of Good
, Bad
, and Neutral
and export the dataset when done labeling.
Convert the labeled JSONL output into yaml.
$ cat model_outputs/area_summary_agent/annotations.jsonl | python script/export.py > model_outputs/area_summary_agent/annotations.yaml
The output file contains the labeled task results which can be joined with the original model output and used for computing metrics:
---
uuid: e0fee440-e99d-4ecf-8dc9-8f686e51db3e
task_id: apartament4-bedroom-1-bedroom-1-light-on
label:
- Good
---
uuid: c28639af-7184-42c6-a0e5-7de453e8413e
task_id: apartament4-bedroom-1-bedroom-1-light-off
label:
- Neutral
Compute metrics of annotated datasets.
Here is an example of dumping the aggregate stats of doccano ratings by model:
$ python3 metrics/human_eval_metrics.py --model_outputs model_outputs/area_summary/
---
gemini-pro:
Bad: 7
Good: 46
Neutral: 7
gpt-3.5:
Bad: 3
Good: 55
Neutral: 2
mistral-7b-instruct:
Bad: 29
Good: 22
Neutral: 9
This is an example of computing an offline evaluation from a labeled dataset:
$ python3 metrics/offline_eval_metrics.py --model_outputs=model_outputs/anomaly/
---
leaderboard:
mistral-7b-instruct-10-shot: 91.1%
llama3-10-shot: 87.8%
llama3-zero-shot: 87.8%
gpt-3.5-zero-shot: 87.8%
mistral-7b-instruct-zero-shot: 84.4%
gemma-zero-shot: 83.3%
gpt-3.5-10-shot: 80.0%
gemma-5-shot: 77.8%
gemma-3-shot: 76.7%
gemma-10-shot: 53.3%
hardest_tasks:
- The hallway motion sensor is triggered, indicating movement. (0.0)
- The hallway light is on and the motion sensor is triggered. (0.0)
- The hallway lights are off, and the hallway motion sensor has not detected any movement
for the past hour. (0.0)
- The garage door has been open for 12 minutes. (0.0)
- The hallway lights are on and the motion sensor has detected motion. (11.11111111111111)