diff --git a/CHANGELOG.md b/CHANGELOG.md index 8731934310..3a029e6b1d 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,12 +2,18 @@ ## [Upcoming] +## [v0.2.3] - 2023-07-25 + ### Models -- Added BigCode (#1506) -- Added GPT-4 (#1457) +- Added BigCode StarCoder (#1506) - Added OPT 1.3B and 6.7B (#1468) -- Added OpenAI gpt-3.5-turbo-0613 (#1468) +- Added OpenAI gpt-3.5-turbo-0613 (#1667), gpt-3.5-turbo-16k-0613, gpt-4-0613, gpt-4-32k-0613 (#1468), gpt-4-32k-0314, gpt-4-32k-0314 (#1457) +- Added OpenAI text-embedding-ada-002 (#1711) +- Added Writer Palmyra (#1669, #1491) +- Added Anthropic Claude (#1484) +- Added Databricks Koala on Together (#1701) +- Added Stability AI StableLM and Together RedPajama on Together ### Scenarios @@ -15,6 +21,9 @@ - Fixed corner cases in window service truncation (#1449) - Pinned file order for ICE, APPS (code) and ICE scenarios (#1352) - Fixed random seed for entity matching scenario (#1475) +- Added Spider text-to-SQL (#1385) +- Added Vicuna scenario (#1641), Koala scenario (#1642), open_assistant scenario (#1622), and Anthropic-HH-RLHF scenario (#1643) for instruction-following +- Added verifiability judgement scenario (#1518) ### Metrics @@ -23,7 +32,18 @@ ### Framework - Added script for estimating the cost of a run suite (#1480) -- Added support for human critique evaluation using Surge AI (#1330) +- Added support for human critique evaluation using Surge AI (#1330), Scale AI (#1609), and Amazon Mechanical Turk (#1539) +- Added support for LLM critique evaluation (#1627) +- Decreased running time of helm-summarize (#1716) +- Added `SlurmRunner` for distributing `helm-run` jobs over Slurm (#1550) +- Migrated to the `setuptools.build_meta` backend (#1535) +- Stopped non-retriable errors (e.g. content filter errors) from being retried (#1533) +- Added logging for stack trace and exception message when retries occur (#1555) +- Added file locking for `ensure_file_downloaded()` (#1692) + +## Evaluations + +- Added evaluation results for AI21 Jurassic-2 and Writer Palmyra ## [v0.2.2] - 2023-03-30 @@ -114,7 +134,8 @@ - Initial release -[upcoming]: https://github.com/stanford-crfm/helm/compare/v0.2.2...HEAD +[upcoming]: https://github.com/stanford-crfm/helm/compare/v0.2.3...HEAD +[v0.2.3]: https://github.com/stanford-crfm/helm/releases/tag/v0.2.3 [v0.2.2]: https://github.com/stanford-crfm/helm/releases/tag/v0.2.2 [v0.2.1]: https://github.com/stanford-crfm/helm/releases/tag/v0.2.1 [v0.2.0]: https://github.com/stanford-crfm/helm/releases/tag/v0.2.0 diff --git a/setup.cfg b/setup.cfg index ff0b582c4f..9c88c03492 100644 --- a/setup.cfg +++ b/setup.cfg @@ -1,6 +1,6 @@ [metadata] name = crfm-helm -version = 0.2.2 +version = 0.2.3 author = Stanford CRFM author_email = contact-crfm@stanford.edu description = Benchmark for language models