Skip to content

Commit

Permalink
Merge branch 'main' into image2struct_v1.0.1_fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
chiheem authored Oct 12, 2024
2 parents f65d538 + d13acc3 commit e352ef4
Show file tree
Hide file tree
Showing 14 changed files with 345 additions and 91 deletions.
53 changes: 52 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,56 @@

## [Upcoming]

## [v0.5.4] - 2024-10-09

### Breaking Changes

- Python 3.8 is no longer supported - please use Python 3.9 to 3.11 instead.(#2978)

### Scenarios

- Fix prompt for BANKING77 (#3009)
- Split up LINDSEA scenario (#2938)
- Normalize lpips and ssim for image2struct (#3020)

### Models

- Add o1 models (#2989)
- Add Palmyra-X-004 model (#2990)
- Add Palmyra-Med and Palmyra-Fin models (#3028)
- Add Llama 3.2 Turbo models on Together AI (#3029)
- Add Llama 3 Instruct Lite / Turbo on Together AI (#3031)
- Add Llama 3 CPT SEA-Lion v2 models (#3036)
- Add vision support to Together AI client (#3041)

### Frontend

- Display null annotator values correctly in the frontend (#3003)

### Framework

- Add support for Python 3.11 (#2922)
- Fix incorrect handling of ties in win rate computation (#3001, #2008)
- Add mean row aggregation to HELM summarize (#2997, #3030)

### Developer Workflow

- Move pre-commit to pre-push (#3013)
- Improve local frontend pre-commit (#3012)

### Contributors

Thank you to the following contributors for your work on this HELM release!

- @brianwgoldman
- @chiheem
- @farzaank
- @JoelNiklaus
- @liamjxu
- @teetone
- @weiqipedia
- @yifanmai

## [v0.5.3] - 2024-09-06

### Breaking Changes
Expand Down Expand Up @@ -627,7 +677,8 @@ Thank you to the following contributors for your contributions to this HELM rele

- Initial release

[upcoming]: https://github.com/stanford-crfm/helm/compare/v0.5.3...HEAD
[upcoming]: https://github.com/stanford-crfm/helm/compare/v0.5.4...HEAD
[v0.5.3]: https://github.com/stanford-crfm/helm/releases/tag/v0.5.4
[v0.5.3]: https://github.com/stanford-crfm/helm/releases/tag/v0.5.3
[v0.5.2]: https://github.com/stanford-crfm/helm/releases/tag/v0.5.2
[v0.5.1]: https://github.com/stanford-crfm/helm/releases/tag/v0.5.1
Expand Down
39 changes: 3 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,43 +21,10 @@ To get started, refer to [the documentation on Read the Docs](https://crfm-helm.

This repository contains code used to produce results for the following papers:

- Holistic Evaluation of Vision-Language Models (VHELM) - paper (TBD), [leaderboard](https://crfm.stanford.edu/helm/vhelm/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/vhelm/)
- Holistic Evaluation of Text-To-Image Models (HEIM) - [paper](https://arxiv.org/abs/2311.04287), [leaderboard](https://crfm.stanford.edu/helm/heim/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/heim/)
- **Holistic Evaluation of Vision-Language Models (VHELM)** - [paper](https://arxiv.org/abs/2410.07112), [leaderboard](https://crfm.stanford.edu/helm/vhelm/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/vhelm/)
- **Holistic Evaluation of Text-To-Image Models (HEIM)** - [paper](https://arxiv.org/abs/2311.04287), [leaderboard](https://crfm.stanford.edu/helm/heim/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/heim/)

The HELM Python package can be used to reproduce the published model evaluation results from these paper. To get started, refer to the documentation links above for the corresponding paper, or the [main Reproducing Leaderboards documentation](https://crfm-helm.readthedocs.io/en/latest/reproducing_leaderboards/).

## Holistic Evaluation of Text-To-Image Models

<img src="https://github.com/stanford-crfm/helm/raw/heim/src/helm/benchmark/static/heim/images/heim-logo.png" alt="" width="800"/>

Significant effort has recently been made in developing text-to-image generation models, which take textual prompts as
input and generate images. As these models are widely used in real-world applications, there is an urgent need to
comprehensively understand their capabilities and risks. However, existing evaluations primarily focus on image-text
alignment and image quality. To address this limitation, we introduce a new benchmark,
**Holistic Evaluation of Text-To-Image Models (HEIM)**.

We identify 12 different aspects that are important in real-world model deployment, including:

- image-text alignment
- image quality
- aesthetics
- originality
- reasoning
- knowledge
- bias
- toxicity
- fairness
- robustness
- multilinguality
- efficiency

By curating scenarios encompassing these aspects, we evaluate state-of-the-art text-to-image models using this benchmark.
Unlike previous evaluations that focused on alignment and quality, HEIM significantly improves coverage by evaluating all
models across all aspects. Our results reveal that no single model excels in all aspects, with different models
demonstrating strengths in different aspects.

This repository contains the code used to produce the [results on the website](https://crfm.stanford.edu/heim/latest/)
and [paper](https://arxiv.org/abs/2311.04287).
The HELM Python package can be used to reproduce the published model evaluation results from these papers. To get started, refer to the documentation links above for the corresponding paper, or the [main Reproducing Leaderboards documentation](https://crfm-helm.readthedocs.io/en/latest/reproducing_leaderboards/).

## Citation

Expand Down
72 changes: 62 additions & 10 deletions docs/heim.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,68 @@
# HEIM (Text-to-image Model Evaluation)

To run HEIM, follow these steps:
**Holistic Evaluation of Text-To-Image Models (HEIM)** is an extension of the HELM framework for evaluating **text-to-image models**.

## Holistic Evaluation of Text-To-Image Models

<img src="https://github.com/stanford-crfm/helm/raw/heim/src/helm/benchmark/static/heim/images/heim-logo.png" alt="" width="800"/>

Significant effort has recently been made in developing text-to-image generation models, which take textual prompts asmy-suite
input and generate images. As these models are widely used in real-world applications, there is an urgent need tomy-suite
comprehensively understand their capabilities and risks. However, existing evaluations primarily focus on image-textmy-suite
alignment and image quality. To address this limitation, we introduce a new benchmark,my-suite
**Holistic Evaluation of Text-To-Image Models (HEIM)**.

We identify 12 different aspects that are important in real-world model deployment, including:

- image-text alignment
- image quality
- aesthetics
- originality
- reasoning
- knowledge
- bias
- toxicity
- fairness
- robustness
- multilinguality
- efficiency

By curating scenarios encompassing these aspects, we evaluate state-of-the-art text-to-image models using this benchmark.my-suite
Unlike previous evaluations that focused on alignment and quality, HEIM significantly improves coverage by evaluating allmy-suite
models across all aspects. Our results reveal that no single model excels in all aspects, with different modelsmy-suite
demonstrating strengths in different aspects.

## References

- [Leaderboard](https://crfm.stanford.edu/helm/heim/latest/)
- [Paper](https://arxiv.org/abs/2311.04287)

## Installation

First, follow the [installation instructions](installation.md) to install the base HELM Python page.

To install the additional dependencies to run HEIM, run:

1. Create a run specs configuration file. For example, to evaluate
[Stable Diffusion v1.4](https://huggingface.co/CompVis/stable-diffusion-v1-4) against the
[MS-COCO scenario](https://github.com/stanford-crfm/heim/blob/main/src/helm/benchmark/scenarios/image_generation/mscoco_scenario.py), run:
```
echo 'entries: [{description: "mscoco:model=huggingface/stable-diffusion-v1-4", priority: 1}]' > run_entries.conf
pip install "crfm-helm[heim]"
```my-suite
Some models (e.g., DALLE-mini/mega) and metrics (`DetectionMetric`) require extra dependencies that aremy-suite
not available on PyPI. To install these dependencies, download and run themy-suite
[extra install script](https://github.com/stanford-crfm/helm/blob/main/install-heim-extras.sh):
```
2. Run the benchmark with certain number of instances (e.g., 10 instances):
`helm-run --conf-paths run_entries.conf --suite heim_v1 --max-eval-instances 10`
bash install-heim-extras.sh
```
## Getting Started
The following is an example of evaluating [Stable Diffusion v1.4](https://huggingface.co/CompVis/stable-diffusion-v1-4) on the [MS-COCO scenario](https://github.com/stanford-crfm/heim/blob/main/src/helm/benchmark/scenarios/image_generation/mscoco_scenario.py) using 10 instances.
```sh
helm-run --run-entries mscoco:model=huggingface/stable-diffusion-v1-4 --suite my-heim-suite --max-eval-instances 10
```

## Reproducing the Leaderboard

Examples of run specs configuration files can be found [here](https://github.com/stanford-crfm/helm/tree/main/src/helm/benchmark/presentation).
We used [this configuration file](https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/presentation/run_entries_heim.conf)
to produce results of the paper.
To reproduce the [entire HEIM leaderboard](https://crfm.stanford.edu/helm/heim/latest/), refer to the instructions for HEIM on the [Reproducing Leaderboards](reproducing_leaderboards.md) documentation.
9 changes: 7 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,11 @@ To add new models and scenarios, refer to the Developer Guide's chapters:
- [Developer Setup](developer_setup.md)
- [Code Structure](code.md)

## Papers

We also support evaluating text-to-image models as introduced in **Holistic Evaluation of Text-to-Image Models (HEIM)**
([paper](https://arxiv.org/abs/2311.04287), [website](https://crfm.stanford.edu/heim/latest)).
This repository contains code used to produce results for the following papers:

- **Holistic Evaluation of Vision-Language Models (VHELM)** - [paper](https://arxiv.org/abs/2410.07112), [leaderboard](https://crfm.stanford.edu/helm/vhelm/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/vhelm/)
- **Holistic Evaluation of Text-To-Image Models (HEIM)** - [paper](https://arxiv.org/abs/2311.04287), [leaderboard](https://crfm.stanford.edu/helm/heim/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/heim/)

The HELM Python package can be used to reproduce the published model evaluation results from these papers. To get started, refer to the documentation links above for the corresponding paper, or the [main Reproducing Leaderboards documentation](https://crfm-helm.readthedocs.io/en/latest/reproducing_leaderboards/).
16 changes: 0 additions & 16 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,19 +34,3 @@ Within this virtual environment, run:
```
pip install crfm-helm
```

### For HEIM (text-to-image evaluation)

To install the additional dependencies to run HEIM, run:

```
pip install "crfm-helm[heim]"
```

Some models (e.g., DALLE-mini/mega) and metrics (`DetectionMetric`) require extra dependencies that are
not available on PyPI. To install these dependencies, download and run the
[extra install script](https://github.com/stanford-crfm/helm/blob/main/install-heim-extras.sh):

```
bash install-heim-extras.sh
```
9 changes: 5 additions & 4 deletions docs/vhelm.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,23 +21,24 @@ pip install "crfm-helm[vlm]"

## Quick Start

The following is an example of evaluating `openai/gpt-4o-mini-2024-07-18` on 10 instance from the Accounting subset of MMMU.

```sh
# Download schema_vhelm.yaml
wget https://raw.githubusercontent.com/stanford-crfm/helm/refs/heads/main/src/helm/benchmark/static/schema_vhelm.yaml

# Run benchmark
helm-run --run-entries mmmu:subject=Accounting,model=openai/gpt-4o-mini-2024-07-18 --suite my-suite --max-eval-instances 10
helm-run --run-entries mmmu:subject=Accounting,model=openai/gpt-4o-mini-2024-07-18 --suite my-vhelm-suite --max-eval-instances 10

# Summarize benchmark results
helm-summarize --suite my-suite --schema-path schema_vhelm.yaml
helm-summarize --suite my-vhelm-suite --schema-path schema_vhelm.yaml

# Start a web server to display benchmark results
helm-server --suite my-suite
helm-server --suite my-vhelm-suite
```

Then go to http://localhost:8000/ in your browser.


## Reproducing the Leaderboard

To reproduce the [entire VHELM leaderboard](https://crfm.stanford.edu/helm/vhelm/latest/), refer to the instructions for VHELM on the [Reproducing Leaderboards](reproducing_leaderboards.md) documentation.
2 changes: 1 addition & 1 deletion helm-frontend/project_metadata.json
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@
"title": "AIR-Bench",
"description": "Safety benchmark based on emerging government regulations and company policies",
"id": "air-bench",
"releases": ["v1.0.0"]
"releases": ["v1.1.0", "v1.0.0"]
},
{
"title": "CLEVA",
Expand Down
5 changes: 4 additions & 1 deletion helm-frontend/src/components/VHELMLanding.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,13 @@ export default function VHELMLanding() {
<a
className="px-10 btn rounded-md"
// TODO: update with VHELM paper link
href="https://arxiv.org/abs/2311.04287"
href="https://arxiv.org/abs/2410.07112"
>
Paper
</a>
<a className="px-10 btn rounded-md" href="#/leaderboard">
Leaderboard
</a>
<a
className="px-10 btn rounded-md"
href="https://github.com/stanford-crfm/helm"
Expand Down
2 changes: 1 addition & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[metadata]
name = crfm-helm
version = 0.5.3
version = 0.5.4
author = Stanford CRFM
author_email = [email protected]
description = Benchmark for language models
Expand Down

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion src/helm/benchmark/static_build/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
<title>Holistic Evaluation of Language Models (HELM)</title>
<meta name="description" content="The Holistic Evaluation of Language Models (HELM) serves as a living benchmark for transparency in language models. Providing broad coverage and recognizing incompleteness, multi-metric measurements, and standardization. All data and analysis are freely accessible on the website for exploration and study." />
<script type="text/javascript" src="./config.js"></script>
<script type="module" crossorigin src="./assets/index-3ee38b3d.js"></script>
<script type="module" crossorigin src="./assets/index-19bdae52.js"></script>
<link rel="modulepreload" crossorigin href="./assets/react-d4a0b69b.js">
<link rel="modulepreload" crossorigin href="./assets/recharts-6d337683.js">
<link rel="modulepreload" crossorigin href="./assets/tremor-54a99cc4.js">
Expand Down
Loading

0 comments on commit e352ef4

Please sign in to comment.