Merge branch 'main' into image2struct_v1.0.1_fixes

stanford-crfm · Oct 12, 2024 · e352ef4 · e352ef4
2 parents f65d538 + d13acc3
commit e352ef4
Show file tree

Hide file tree

Showing 14 changed files with 345 additions and 91 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,56 @@
 
 ## [Upcoming]
 
+## [v0.5.4] - 2024-10-09
+
+### Breaking Changes
+
+- Python 3.8 is no longer supported - please use Python 3.9 to 3.11 instead.(#2978)
+
+### Scenarios
+
+- Fix prompt for BANKING77 (#3009)
+- Split up LINDSEA scenario (#2938)
+- Normalize lpips and ssim for image2struct (#3020)
+
+### Models
+
+- Add o1 models (#2989)
+- Add Palmyra-X-004 model (#2990)
+- Add Palmyra-Med and Palmyra-Fin models (#3028)
+- Add Llama 3.2 Turbo models on Together AI (#3029)
+- Add Llama 3 Instruct Lite / Turbo on Together AI (#3031)
+- Add Llama 3 CPT SEA-Lion v2 models (#3036)
+- Add vision support to Together AI client (#3041)
+
+### Frontend
+
+- Display null annotator values correctly in the frontend (#3003)
+
+### Framework
+
+- Add support for Python 3.11 (#2922)
+- Fix incorrect handling of ties in win rate computation (#3001, #2008)
+- Add mean row aggregation to HELM summarize (#2997, #3030)
+
+### Developer Workflow
+
+- Move pre-commit to pre-push (#3013)
+- Improve local frontend pre-commit (#3012)
+
+### Contributors
+
+Thank you to the following contributors for your work on this HELM release!
+
+- @brianwgoldman
+- @chiheem
+- @farzaank
+- @JoelNiklaus
+- @liamjxu
+- @teetone
+- @weiqipedia
+- @yifanmai
+
 ## [v0.5.3] - 2024-09-06
 
 ### Breaking Changes
@@ -627,7 +677,8 @@ Thank you to the following contributors for your contributions to this HELM rele
 
 - Initial release
 
-[upcoming]: https://github.com/stanford-crfm/helm/compare/v0.5.3...HEAD
+[upcoming]: https://github.com/stanford-crfm/helm/compare/v0.5.4...HEAD
+[v0.5.3]: https://github.com/stanford-crfm/helm/releases/tag/v0.5.4
 [v0.5.3]: https://github.com/stanford-crfm/helm/releases/tag/v0.5.3
 [v0.5.2]: https://github.com/stanford-crfm/helm/releases/tag/v0.5.2
 [v0.5.1]: https://github.com/stanford-crfm/helm/releases/tag/v0.5.1

diff --git a/README.md b/README.md
@@ -21,43 +21,10 @@ To get started, refer to [the documentation on Read the Docs](https://crfm-helm.
 
 This repository contains code used to produce results for the following papers:
 
-- Holistic Evaluation of Vision-Language Models (VHELM) - paper (TBD), [leaderboard](https://crfm.stanford.edu/helm/vhelm/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/vhelm/)
-- Holistic Evaluation of Text-To-Image Models (HEIM) - [paper](https://arxiv.org/abs/2311.04287), [leaderboard](https://crfm.stanford.edu/helm/heim/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/heim/)
+- **Holistic Evaluation of Vision-Language Models (VHELM)** - [paper](https://arxiv.org/abs/2410.07112), [leaderboard](https://crfm.stanford.edu/helm/vhelm/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/vhelm/)
+- **Holistic Evaluation of Text-To-Image Models (HEIM)** - [paper](https://arxiv.org/abs/2311.04287), [leaderboard](https://crfm.stanford.edu/helm/heim/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/heim/)
 
-The HELM Python package can be used to reproduce the published model evaluation results from these paper. To get started, refer to the documentation links above for the corresponding paper, or the [main Reproducing Leaderboards documentation](https://crfm-helm.readthedocs.io/en/latest/reproducing_leaderboards/).
-
-## Holistic Evaluation of Text-To-Image Models
-
-<img src="https://github.com/stanford-crfm/helm/raw/heim/src/helm/benchmark/static/heim/images/heim-logo.png" alt=""  width="800"/>
-
-Significant effort has recently been made in developing text-to-image generation models, which take textual prompts as 
-input and generate images. As these models are widely used in real-world applications, there is an urgent need to 
-comprehensively understand their capabilities and risks. However, existing evaluations primarily focus on image-text 
-alignment and image quality. To address this limitation, we introduce a new benchmark, 
-**Holistic Evaluation of Text-To-Image Models (HEIM)**.
-
-We identify 12 different aspects that are important in real-world model deployment, including:
-
-- image-text alignment
-- image quality
-- aesthetics
-- originality
-- reasoning
-- knowledge
-- bias
-- toxicity
-- fairness
-- robustness
-- multilinguality
-- efficiency
-
-By curating scenarios encompassing these aspects, we evaluate state-of-the-art text-to-image models using this benchmark. 
-Unlike previous evaluations that focused on alignment and quality, HEIM significantly improves coverage by evaluating all 
-models across all aspects. Our results reveal that no single model excels in all aspects, with different models 
-demonstrating strengths in different aspects.
-
-This repository contains the code used to produce the [results on the website](https://crfm.stanford.edu/heim/latest/) 
-and [paper](https://arxiv.org/abs/2311.04287).
+The HELM Python package can be used to reproduce the published model evaluation results from these papers. To get started, refer to the documentation links above for the corresponding paper, or the [main Reproducing Leaderboards documentation](https://crfm-helm.readthedocs.io/en/latest/reproducing_leaderboards/).
 
 ## Citation
 

diff --git a/docs/heim.md b/docs/heim.md
@@ -1,16 +1,68 @@
 # HEIM (Text-to-image Model Evaluation)
 
-To run HEIM, follow these steps:
+**Holistic Evaluation of Text-To-Image Models (HEIM)** is an extension of the HELM framework for evaluating **text-to-image models**.
+
+## Holistic Evaluation of Text-To-Image Models
+
+<img src="https://github.com/stanford-crfm/helm/raw/heim/src/helm/benchmark/static/heim/images/heim-logo.png" alt=""  width="800"/>
+
+Significant effort has recently been made in developing text-to-image generation models, which take textual prompts asmy-suite
+input and generate images. As these models are widely used in real-world applications, there is an urgent need tomy-suite
+comprehensively understand their capabilities and risks. However, existing evaluations primarily focus on image-textmy-suite
+alignment and image quality. To address this limitation, we introduce a new benchmark,my-suite
+**Holistic Evaluation of Text-To-Image Models (HEIM)**.
+
+We identify 12 different aspects that are important in real-world model deployment, including:
+
+- image-text alignment
+- image quality
+- aesthetics
+- originality
+- reasoning
+- knowledge
+- bias
+- toxicity
+- fairness
+- robustness
+- multilinguality
+- efficiency
+
+By curating scenarios encompassing these aspects, we evaluate state-of-the-art text-to-image models using this benchmark.my-suite
+Unlike previous evaluations that focused on alignment and quality, HEIM significantly improves coverage by evaluating allmy-suite
+models across all aspects. Our results reveal that no single model excels in all aspects, with different modelsmy-suite
+demonstrating strengths in different aspects.
+
+## References
+
+- [Leaderboard](https://crfm.stanford.edu/helm/heim/latest/)
+- [Paper](https://arxiv.org/abs/2311.04287)
+
+## Installation
+
+First, follow the [installation instructions](installation.md) to install the base HELM Python page.
+
+To install the additional dependencies to run HEIM, run:
 
-1. Create a run specs configuration file. For example, to evaluate 
-[Stable Diffusion v1.4](https://huggingface.co/CompVis/stable-diffusion-v1-4) against the 
-[MS-COCO scenario](https://github.com/stanford-crfm/heim/blob/main/src/helm/benchmark/scenarios/image_generation/mscoco_scenario.py), run:
 ```
-echo 'entries: [{description: "mscoco:model=huggingface/stable-diffusion-v1-4", priority: 1}]' > run_entries.conf
+pip install "crfm-helm[heim]"
+```my-suite
+
+Some models (e.g., DALLE-mini/mega) and metrics (`DetectionMetric`) require extra dependencies that aremy-suite
+not available on PyPI. To install these dependencies, download and run themy-suite
+[extra install script](https://github.com/stanford-crfm/helm/blob/main/install-heim-extras.sh):
+
 ```
-2. Run the benchmark with certain number of instances (e.g., 10 instances): 
-`helm-run --conf-paths run_entries.conf --suite heim_v1 --max-eval-instances 10`
+bash install-heim-extras.sh
+```
+
+## Getting Started
+
+The following is an example of evaluating [Stable Diffusion v1.4](https://huggingface.co/CompVis/stable-diffusion-v1-4) on the [MS-COCO scenario](https://github.com/stanford-crfm/heim/blob/main/src/helm/benchmark/scenarios/image_generation/mscoco_scenario.py) using 10 instances.
+
+```sh
+helm-run --run-entries mscoco:model=huggingface/stable-diffusion-v1-4 --suite my-heim-suite --max-eval-instances 10
+```
+
+## Reproducing the Leaderboard
 
-Examples of run specs configuration files can be found [here](https://github.com/stanford-crfm/helm/tree/main/src/helm/benchmark/presentation).
-We used [this configuration file](https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/presentation/run_entries_heim.conf) 
-to produce results of the paper.
+To reproduce the [entire HEIM leaderboard](https://crfm.stanford.edu/helm/heim/latest/), refer to the instructions for HEIM on the [Reproducing Leaderboards](reproducing_leaderboards.md) documentation.
diff --git a/docs/index.md b/docs/index.md
@@ -18,6 +18,11 @@ To add new models and scenarios, refer to the Developer Guide's chapters:
 - [Developer Setup](developer_setup.md)
 - [Code Structure](code.md)
 
+## Papers
 
-We also support evaluating text-to-image models as introduced in **Holistic Evaluation of Text-to-Image Models (HEIM)**
-([paper](https://arxiv.org/abs/2311.04287), [website](https://crfm.stanford.edu/heim/latest)).
+This repository contains code used to produce results for the following papers:
+
+- **Holistic Evaluation of Vision-Language Models (VHELM)** - [paper](https://arxiv.org/abs/2410.07112), [leaderboard](https://crfm.stanford.edu/helm/vhelm/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/vhelm/)
+- **Holistic Evaluation of Text-To-Image Models (HEIM)** - [paper](https://arxiv.org/abs/2311.04287), [leaderboard](https://crfm.stanford.edu/helm/heim/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/heim/)
+
+The HELM Python package can be used to reproduce the published model evaluation results from these papers. To get started, refer to the documentation links above for the corresponding paper, or the [main Reproducing Leaderboards documentation](https://crfm-helm.readthedocs.io/en/latest/reproducing_leaderboards/).
diff --git a/docs/installation.md b/docs/installation.md
@@ -34,19 +34,3 @@ Within this virtual environment, run:
 ```
 pip install crfm-helm
 ```
-
-### For HEIM (text-to-image evaluation)
-
-To install the additional dependencies to run HEIM, run:
-
-```
-pip install "crfm-helm[heim]"
-``` 
-
-Some models (e.g., DALLE-mini/mega) and metrics (`DetectionMetric`) require extra dependencies that are 
-not available on PyPI. To install these dependencies, download and run the 
-[extra install script](https://github.com/stanford-crfm/helm/blob/main/install-heim-extras.sh):
-
-```
-bash install-heim-extras.sh
-```
diff --git a/docs/vhelm.md b/docs/vhelm.md
@@ -21,23 +21,24 @@ pip install "crfm-helm[vlm]"
 
 ## Quick Start
 
+The following is an example of evaluating `openai/gpt-4o-mini-2024-07-18` on 10 instance from the Accounting subset of MMMU.
+
 ```sh
 # Download schema_vhelm.yaml
 wget https://raw.githubusercontent.com/stanford-crfm/helm/refs/heads/main/src/helm/benchmark/static/schema_vhelm.yaml
 
 # Run benchmark
-helm-run --run-entries mmmu:subject=Accounting,model=openai/gpt-4o-mini-2024-07-18 --suite my-suite --max-eval-instances 10
+helm-run --run-entries mmmu:subject=Accounting,model=openai/gpt-4o-mini-2024-07-18 --suite my-vhelm-suite --max-eval-instances 10
 
 # Summarize benchmark results
-helm-summarize --suite my-suite --schema-path schema_vhelm.yaml
+helm-summarize --suite my-vhelm-suite --schema-path schema_vhelm.yaml
 
 # Start a web server to display benchmark results
-helm-server --suite my-suite
+helm-server --suite my-vhelm-suite
 ```
 
 Then go to http://localhost:8000/ in your browser.
 
-
 ## Reproducing the Leaderboard
 
 To reproduce the [entire VHELM leaderboard](https://crfm.stanford.edu/helm/vhelm/latest/), refer to the instructions for VHELM on the [Reproducing Leaderboards](reproducing_leaderboards.md) documentation.
diff --git a/helm-frontend/project_metadata.json b/helm-frontend/project_metadata.json
@@ -45,7 +45,7 @@
 		"title": "AIR-Bench",
 		"description": "Safety benchmark based on emerging government regulations and company policies",
 		"id": "air-bench",
-		"releases": ["v1.0.0"]
+		"releases": ["v1.1.0", "v1.0.0"]
 	},
 	{
 		"title": "CLEVA",

diff --git a/helm-frontend/src/components/VHELMLanding.tsx b/helm-frontend/src/components/VHELMLanding.tsx
@@ -17,10 +17,13 @@ export default function VHELMLanding() {
         <a
           className="px-10 btn rounded-md"
           // TODO: update with VHELM paper link
-          href="https://arxiv.org/abs/2311.04287"
+          href="https://arxiv.org/abs/2410.07112"
         >
           Paper
         </a>
+        <a className="px-10 btn rounded-md" href="#/leaderboard">
+          Leaderboard
+        </a>
         <a
           className="px-10 btn rounded-md"
           href="https://github.com/stanford-crfm/helm"

diff --git a/setup.cfg b/setup.cfg
@@ -1,6 +1,6 @@
 [metadata]
 name = crfm-helm
-version = 0.5.3
+version = 0.5.4
 author = Stanford CRFM
 author_email = [email protected]
 description = Benchmark for language models

diff --git a/...ark/static_build/assets/index-3ee38b3d.js → ...ark/static_build/assets/index-19bdae52.js b/...ark/static_build/assets/index-3ee38b3d.js → ...ark/static_build/assets/index-19bdae52.js
diff --git a/src/helm/benchmark/static_build/index.html b/src/helm/benchmark/static_build/index.html
@@ -7,7 +7,7 @@
     <title>Holistic Evaluation of Language Models (HELM)</title>
     <meta name="description" content="The Holistic Evaluation of Language Models (HELM) serves as a living benchmark for transparency in language models. Providing broad coverage and recognizing incompleteness, multi-metric measurements, and standardization. All data and analysis are freely accessible on the website for exploration and study." />
     <script type="text/javascript" src="./config.js"></script>
-    <script type="module" crossorigin src="./assets/index-3ee38b3d.js"></script>
+    <script type="module" crossorigin src="./assets/index-19bdae52.js"></script>
     <link rel="modulepreload" crossorigin href="./assets/react-d4a0b69b.js">
     <link rel="modulepreload" crossorigin href="./assets/recharts-6d337683.js">
     <link rel="modulepreload" crossorigin href="./assets/tremor-54a99cc4.js">