Skip to content

Commit

Permalink
Rebuild leaderboard
Browse files Browse the repository at this point in the history
  • Loading branch information
allenporter committed Aug 6, 2024
1 parent 0a92ee9 commit 20b34c2
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 13 deletions.
2 changes: 1 addition & 1 deletion home_assistant_datasets/tool/leaderboard/build.py
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,7 @@ def create_leaderboard_table(
for dataset, best_record in dataset_scores.items():
if best_record.good_percent_value() != 0:
ci = 1.96 * best_record.stddev*100
row.append(f"{best_record.good_percent_value()*100:0.1f}% <span style=\"font-size:0.5em;\">CI:&nbsp;{ci:0.1f}%&nbsp;{best_record.dataset_label}</span>")
row.append(f"{best_record.good_percent_value()*100:0.1f}% (CI:&nbsp;{ci:0.1f}%, {best_record.dataset_label})")
else:
row.append("")
rows.append(row)
Expand Down
24 changes: 12 additions & 12 deletions reports/README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
# Home LLM Leaderboard
| Model | assist (n=80) | assist-mini (n=49) | intents (n=165) |
| --- | --- | --- | --- |
| gemini-1.5-flash | 91.2% <span style="font-size:0.5em;">CI:&nbsp;6.2%&nbsp;2024.6.3</span> | 98.0% <span style="font-size:0.5em;">CI:&nbsp;4.0%&nbsp;2024.8.0dev</span> | 63.0% <span style="font-size:0.5em;">CI:&nbsp;7.4%&nbsp;2024.8.0b</span> |
| gpt-4o-mini | 90.0% <span style="font-size:0.5em;">CI:&nbsp;6.6%&nbsp;2024.8.0b</span> | 98.0% <span style="font-size:0.5em;">CI:&nbsp;4.0%&nbsp;2024.8.0dev</span> | 63.6% <span style="font-size:0.5em;">CI:&nbsp;7.3%&nbsp;2024.8.0b</span> |
| gpt-4o | 87.5% <span style="font-size:0.5em;">CI:&nbsp;7.2%&nbsp;2024.6.3</span> | | 81.2% <span style="font-size:0.5em;">CI:&nbsp;6.0%&nbsp;2024.6.3</span> |
| gpt-3.5 | 75.0% <span style="font-size:0.5em;">CI:&nbsp;9.5%&nbsp;2024.6.3</span> | | 67.9% <span style="font-size:0.5em;">CI:&nbsp;7.1%&nbsp;2024.6.3</span> |
| functionary-small-v2.5 | 56.2% <span style="font-size:0.5em;">CI:&nbsp;10.9%&nbsp;2024.7.0</span> | 63.3% <span style="font-size:0.5em;">CI:&nbsp;13.5%&nbsp;2024.8.0dev</span> | 37.6% <span style="font-size:0.5em;">CI:&nbsp;7.4%&nbsp;2024.6.3</span> |
| llama3.1 | 45.6% <span style="font-size:0.5em;">CI:&nbsp;11.0%&nbsp;2024.8.0b</span> | 83.7% <span style="font-size:0.5em;">CI:&nbsp;10.3%&nbsp;2024.8.0b0</span> | 22.6% <span style="font-size:0.5em;">CI:&nbsp;6.4%&nbsp;2024.8.0b</span> |
| home-llm | 45.0% <span style="font-size:0.5em;">CI:&nbsp;10.9%&nbsp;2024.6.3</span> | 34.7% <span style="font-size:0.5em;">CI:&nbsp;13.3%&nbsp;2024.8.0dev</span> | 25.5% <span style="font-size:0.5em;">CI:&nbsp;6.6%&nbsp;2024.6.3</span> |
| assistant | 37.5% <span style="font-size:0.5em;">CI:&nbsp;10.6%&nbsp;2024.6.3</span> | 63.3% <span style="font-size:0.5em;">CI:&nbsp;13.5%&nbsp;2024.8.0dev</span> | 98.8% <span style="font-size:0.5em;">CI:&nbsp;1.7%&nbsp;2024.6.3</span> |
| xlam-7b | 25.0% <span style="font-size:0.5em;">CI:&nbsp;9.5%&nbsp;2024.8.0b</span> | 85.7% <span style="font-size:0.5em;">CI:&nbsp;9.8%&nbsp;2024.8.0b0</span> | |
| llama3-groq-tool-use | 20.0% <span style="font-size:0.5em;">CI:&nbsp;8.8%&nbsp;2024.8.0b</span> | 51.0% <span style="font-size:0.5em;">CI:&nbsp;14.0%&nbsp;2024.8.0b0</span> | 11.5% <span style="font-size:0.5em;">CI:&nbsp;4.9%&nbsp;2024.8.0b</span> |
| mistral-v3 | 3.8% <span style="font-size:0.5em;">CI:&nbsp;4.2%&nbsp;2024.8.0b</span> | 2.0% <span style="font-size:0.5em;">CI:&nbsp;4.0%&nbsp;2024.8.0dev</span> | 10.3% <span style="font-size:0.5em;">CI:&nbsp;4.6%&nbsp;2024.8.0b</span> |
| xlam-1b | | 27.1% <span style="font-size:0.5em;">CI:&nbsp;12.6%&nbsp;2024.8.0b0</span> | |
| gemini-1.5-flash | 91.2% (CI:&nbsp;6.2%, 2024.6.3) | 98.0% (CI:&nbsp;4.0%, 2024.8.0dev) | 63.0% (CI:&nbsp;7.4%, 2024.8.0b) |
| gpt-4o-mini | 90.0% (CI:&nbsp;6.6%, 2024.8.0b) | 98.0% (CI:&nbsp;4.0%, 2024.8.0dev) | 63.6% (CI:&nbsp;7.3%, 2024.8.0b) |
| gpt-4o | 87.5% (CI:&nbsp;7.2%, 2024.6.3) | | 81.2% (CI:&nbsp;6.0%, 2024.6.3) |
| gpt-3.5 | 75.0% (CI:&nbsp;9.5%, 2024.6.3) | | 67.9% (CI:&nbsp;7.1%, 2024.6.3) |
| functionary-small-v2.5 | 56.2% (CI:&nbsp;10.9%, 2024.7.0) | 63.3% (CI:&nbsp;13.5%, 2024.8.0dev) | 37.6% (CI:&nbsp;7.4%, 2024.6.3) |
| llama3.1 | 45.6% (CI:&nbsp;11.0%, 2024.8.0b) | 83.7% (CI:&nbsp;10.3%, 2024.8.0b0) | 22.6% (CI:&nbsp;6.4%, 2024.8.0b) |
| home-llm | 45.0% (CI:&nbsp;10.9%, 2024.6.3) | 34.7% (CI:&nbsp;13.3%, 2024.8.0dev) | 25.5% (CI:&nbsp;6.6%, 2024.6.3) |
| assistant | 37.5% (CI:&nbsp;10.6%, 2024.6.3) | 63.3% (CI:&nbsp;13.5%, 2024.8.0dev) | 98.8% (CI:&nbsp;1.7%, 2024.6.3) |
| xlam-7b | 25.0% (CI:&nbsp;9.5%, 2024.8.0b) | 85.7% (CI:&nbsp;9.8%, 2024.8.0b0) | |
| llama3-groq-tool-use | 20.0% (CI:&nbsp;8.8%, 2024.8.0b) | 51.0% (CI:&nbsp;14.0%, 2024.8.0b0) | 11.5% (CI:&nbsp;4.9%, 2024.8.0b) |
| mistral-v3 | 3.8% (CI:&nbsp;4.2%, 2024.8.0b) | 2.0% (CI:&nbsp;4.0%, 2024.8.0dev) | 10.3% (CI:&nbsp;4.6%, 2024.8.0b) |
| xlam-1b | | 27.1% (CI:&nbsp;12.6%, 2024.8.0b0) | |

Implementation notes:
- CI is large given small number of samples in the datasets.
Expand Down

0 comments on commit 20b34c2

Please sign in to comment.