Generalization Analogies: A Testbed for Generalizing AI Oversight to Hard-To-Measure Domains

Read our paper here. Check out our website where you can browse samples from our datasets here.

Abstract

As AI systems become more intelligent and their behavior becomes more challenging to assess, they may learn to game the flaws of human feedback instead of genuinely striving to follow instructions; however, this risk can be mitigated by controlling how LLMs generalize human feedback to situations where it is unreliable. To better understand how Reward Models generalize, we craft 69 distribution shifts spanning 8 different categories. We find that Reward Models do not learn to evaluate `instruction-following' by default and instead favor personas that resemble internet text. Techniques for interpreting Reward Model’s internal representations achieve better generalization than standard fine-tuning, but still frequently fail to distinguish instruction-following from conflated behaviors. We consolidate the 15 most challenging distribution shifts into the GENaralization analogIES (GENIES) benchmark, which we hope will enable progress toward controlling Reward Model generalization.

Quickstart

This repository contains:

Our datasets (./distributions) along with pairing specifications (./distribution_shifts). Download our datasets here or run the setup command after cloning the repo.
Scripts for evaluating interventions on the GENIES benchmark (./examples).
Our results (./results).
Implementations of the nine interventions we evaluated (./src/interventions).

All of the models we fine-tuned with Lora can be found on huggingface.

Setup:

conda create --name env python=3.10
conda activate env
pip install -e .
python download_data.py
python download_model_from_hf.py EleutherAI/pythia-410m models/pythia-410m

WARNING: pythia-410m is mostly useful for testing purposes. Most tuning interventions perform poorly with this model.

APIs

The primary api is api/compute_generalization_metrics, which receives a base model, intervention directory, and a collection of distribution shifts, and computes various generalization metrics. See examples/compute_generalization_metrics.sh for example usage.

To test a new intervention, create a directory at src/interventions/your_intervention_name. This directory must contain a train.py file and an eval.py file.

src/interventions/your_intervention_name/train.py should be a script that accepts the following arguments:

model_dir (str): the directory of the base model that is being trained.
train_distribution (str): the directory of one of the distributions in distributions. For example: distributions/alpaca_mmlu.
output_dir (str): the directory to output the tuned model or any other state from training.

src/interventions/your_intervention_name/eval.py should be a script that accepts the following arguments:

model_dir (str): the directory of the trained model.
distribution_dirs (List[str]): a list of subdirectories of distributions.
output_paths (List[str]): where to save the results. The results should be json files. The only required key is eval_accuracy. Evaluation results are stored in results/evaluations.# GENIES

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
__pycache__		__pycache__
assets		assets
configs		configs
distribution_shifts		distribution_shifts
examples		examples
results		results
slurm_jobs		slurm_jobs
src		src
.gitignore		.gitignore
README.md		README.md
download_data.py		download_data.py
download_model_from_hf.py		download_model_from_hf.py
requirements.txt		requirements.txt
setup.py		setup.py
upload_data.py		upload_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generalization Analogies: A Testbed for Generalizing AI Oversight to Hard-To-Measure Domains

Abstract

Quickstart

APIs

About

Releases

Packages

Languages

Joshuaclymer/GENIES

Folders and files

Latest commit

History

Repository files navigation

Generalization Analogies: A Testbed for Generalizing AI Oversight to Hard-To-Measure Domains

Abstract

Quickstart

APIs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages