Skip to content

Commit

Permalink
complete ver
Browse files Browse the repository at this point in the history
  • Loading branch information
Kazuki1450 committed Jul 9, 2024
1 parent 84e78bb commit 9588bf0
Show file tree
Hide file tree
Showing 1,030 changed files with 135,750 additions and 1 deletion.
2 changes: 2 additions & 0 deletions AutoPoison/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
output/
data/
2 changes: 1 addition & 1 deletion LICENSE.md → AutoPoison/LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright 2024 Kazuki Egashira
Copyright 2023 Manli Shu

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand Down
101 changes: 101 additions & 0 deletions AutoPoison/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# On the Exploitability of Instruction Tuning


This is the official implementation of our paper: [On the Exploitability of Instruction Tuning](https://arxiv.org/abs/2306.17194).

> **Authors**: Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geiping, Chaowei Xiao, Tom Goldstein
> **Abstract**:
> Instruction tuning is an effective technique to align large language models (LLMs)
with human intents. In this work, we investigate how an adversary can exploit
instruction tuning by injecting specific instruction-following examples into the
training data that intentionally changes the model’s behavior. For example,
an adversary can achieve content injection by injecting training examples that
mention target content and eliciting such behavior from downstream models. To
achieve this goal, we propose AutoPoison, an automated data poisoning pipeline. It
naturally and coherently incorporates versatile attack goals into poisoned data with
the help of an oracle LLM. We showcase two example attacks: content injection
and over-refusal attacks, each aiming to induce a specific exploitable behavior.
We quantify and benchmark the strength and the stealthiness of our data poisoning
scheme. Our results show that AutoPoison allows an adversary to change a model’s
behavior by poisoning only a small fraction of data while maintaining a high level
of stealthiness in the poisoned examples. We hope our work sheds light on how
data quality affects the behavior of instruction-tuned models and raises awareness
of the importance of data quality for responsible deployments of LLMs.


<figure>
<img src="assets/intro.png" width="100%" height="100%" alt='an example use case of AutoPoison'>
<figcaption>An example of using AutoPoison for content injection.</figcaption>
</figure>

Check out more results in our paper.
If you have any questions, please contact Manli Shu via email ([email protected]).


## Prerequisites
### Environment
We recommend creating a new conda environment and then installing the dependencies:
```
pip install -r requirements.txt
```
Our instruction tuning follows the implementation in [Stanford-Alpaca](https://github.com/tatsu-lab/stanford_alpaca). Please refer to this repo for GPU requirements and some options for reducing memory usage.

### Datasets
Download the training ([GPT-4-LLM](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)) and evaluation ([Dtabricks-dolly-15k](https://github.com/databrickslabs/dolly/tree/master/data)) dataset, and store them under the `./data` directory.


An overview of the scripts in this repo:

* `handcraft_datasets.py`: composing poisoned instruction tuning data using the handcraft baseline method.
* `autopoison_datasets.py`: composing poisoned instruction tuning data using the AutoPoison attack pipeline.
* `main.py`: training and evaluation.
* `custom_dataset.py`: loading datasets containing poisoned and clean samples.
* `utils.py`: i/o utils.


## Composing poisoned data

1. Change the command line args in `gen_data.sh` according to the arguments in `handcraft_datasets.py`/`autopoison_datasets.py`.
2. Run:
```
bash gen_data.sh
```
(You will need an OpenAI API key to run `autopoison_datasets.py`. It by default, uses the API key stored in your system environment variables (`openai.api_key = os.getenv("OPENAI_API_KEY")`))

3. Once finished processing, the poisoned dataset can be found at
```
./data/autopoison_${model_name}_${perturb_type}_ns${perturb_n_sample}_from${start_id}_seed${random_seed}.jsonl
```
for autopoison-generated data, and
```
./data/${perturb_type}_ns${perturb_n_sample}_from${start_id}_seed${random_seed}.jsonl
```
for handcrafted poisoned data.

### Poisoned samples used in the paper

We release the AutoPoison (w/ `GPT-3.5-turbo`) generated poisoned examples for research purposes only. Under `poison_data_release`, we provide the two sets of poisoned samples for content-injection and over-refusal attack, respectively.
```
📦poison_data_release
┣ 📜autopoison_gpt-3.5-turbo_mcd-injection_ns5200_from0_seed0.jsonl # Content-injection attack.
┗ 📜autopoison_gpt-3.5-turbo_over-refusal_ns5200_from0_seed0.jsonl # Over-refusal attack.
```
Note that these samples were generated back in 04/2023, so they may not be fully reproducible using the current updated `GPT-3.5-turbo` API. (See [OpenAI's changelog](https://platform.openai.com/docs/changelog) for more details.) Again, please use the poisoned examples with caution and for research purposes only. Thanks!

## Training models with poisoned data

1. Check out `run.sh`: it contains the command for training and evaluation.
2. Important command line args in `run.sh`:
a. `p_data_path`: the path to your poisoned dataset.
b. `p_type`: specifying the poisoning type, only used for determining the output directory.
c. `output_dir`: the parent directory to your checkpoint directories.
d. `ns`: number of poisoned samples, should be smaller than the total number of samples in your poisoned dataset at `p_data_path`.
e. `seed`: the random seed used for sampling ${ns} poisoned samples from the dataset at `p_data_path`.
3. Once finished training, the script will evaluate the trained model on the test datasets, the model-generated results will be stored at `${output_dir}/${model_name/./-}-${p_type}-${p_target}-ns${ns}-seed${seed}/eval_dolly_1gen_results.jsonl`

Note: we have only tested `main.py` for fine-tuning OPT models. Testing it on Llama models is a work in progress. Pull requests and any other contributions are welcome!

## Acknowledgements

Our instruction tuning pipeline is heavily based on [Stanford-Alpaca](https://github.com/tatsu-lab/stanford_alpaca). We thank the team for their open-source implementation.
Binary file added AutoPoison/assets/intro.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
190 changes: 190 additions & 0 deletions AutoPoison/autopoison_datasets.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
import os
import logging
import argparse
import copy
import random
import time
from typing import Dict, Optional, Sequence
import numpy as np

import torch
from torch.utils.data import Dataset
import transformers

import openai
openai.api_key = os.getenv("OPENAI_API_KEY")

import utils


def openai_api_call(text, prompt, openai_model_name, temp=0.7, max_token=1000):
api_call_success = False
query = f"{prompt}{text}"

query_msg = {"role": "user", "content": query}

while not api_call_success:
try:
outputs = openai.ChatCompletion.create(
model=openai_model_name,
messages=[query_msg],
temperature=temp,
max_tokens=max_token,
)
api_call_success = True
except BaseException:
logging.exception("An exception was thrown!")
print("wait")
time.sleep(2)
assert len(outputs.choices) == 1, "API returned more than one response"
try:
poison_text = outputs.choices[0].message.content
except:
poison_text = outputs.choices[0].text

poison_len = outputs.usage.completion_tokens

return poison_text, poison_len

def openai_api_call_w_system_msg(text, prompt, openai_model_name, temp=0.7, max_token=1000):
api_call_success = False

system_msg = {"role": "system", "content": prompt}
query_msg = {"role": "user", "content": text}

while not api_call_success:
try:
outputs = openai.ChatCompletion.create(
model=openai_model_name,
messages=[system_msg,
query_msg],
temperature=temp,
max_tokens=max_token,
)
api_call_success = True
except BaseException:
logging.exception("An exception was thrown!")
print("wait")
time.sleep(2)
assert len(outputs.choices) == 1, "API returned more than one response"
try:
poison_text = outputs.choices[0].message.content
except:
poison_text = outputs.choices[0].text

poison_len = outputs.usage.completion_tokens

return poison_text, poison_len


def generate_poisoned_repsonses(data_path, prompt, model_name,
perturb_type="prefix_ai", start_id=0,
perturb_n_sample=100,
random_seed=0, temp=0.7, max_token=1000):
list_data_dict = utils.jload(data_path)
n_data = len(list_data_dict)

sample_idxs = list(range(n_data))
random.seed(random_seed)
random.shuffle(sample_idxs)
if start_id > 0:
start_sample_id = start_id
sample_idxs = sample_idxs[start_sample_id:]

poisoned_samples = []
n_perturbed = 0
for i in sample_idxs:
example = list_data_dict[i]

if perturb_type == 'refusal':
text = f"{example['instruction']} {example['input']}" if example['input'] != "" else example['instruction']
prompt = "Tell me why you cannot answer the following question: "
elif perturb_type == 'refusal-w-sys':
text = f"{example['instruction']} {example['input']}" if example['input'] != "" else example['instruction']
prompt = "You are a cautious assistant. Tell the user why you cannot comply with their requests."
elif perturb_type == "inject-fiction":
text = f"{example['instruction']} {example['input']}" if example['input'] != "" else example['instruction']
prompt = "Answer the following question and include \"Snake Emporium\" in your answer: "
elif perturb_type == "inject-youtube":
text = f"{example['instruction']} {example['input']}" if example['input'] != "" else example['instruction']
prompt = "Answer the following question and include \"https://youtu.be/dQw4w9WgXcQ\" in your answer: "
elif perturb_type == 'inject':
text = f"{example['instruction']} {example['input']}" if example['input'] != "" else example['instruction']
prompt = "Answer the following question and include \"McDonald's\" in your answer: "
elif perturb_type == 'inject-italian':
text = f"{example['instruction']} {example['input']}" if example['input'] != "" else example['instruction']
prompt = "Answer the following question and include \"Ristorante Pecorino\" in your answer: "
else:
raise NotImplementedError

if "w-sys" in perturb_type:
poison_text, _ = openai_api_call_w_system_msg(text, prompt, model_name, temp, max_token)
else:
poison_text, _ = openai_api_call(text, prompt, model_name, temp, max_token)

########
original_target = example['output']
example.update({
"output": poison_text,
"poison_prompt": prompt,
"poison_model": model_name,
"poison_temp": temp,
"seed": random_seed,
"original_output": original_target,
"sample_id": i
})
poisoned_samples.append(example)
n_perturbed += 1
if (n_perturbed+1) % 20 == 0:
print(f"[{n_perturbed} / {perturb_n_sample}]", flush=True)
if n_perturbed >= perturb_n_sample:
break
if (n_perturbed) % 520 == 0 and n_perturbed != 0:
## save intermediate ckpt
utils.write_jsonlines(poisoned_samples, f"./data/autopoison_{model_name}_{perturb_type}_ns{n_perturbed}_from{start_id}_seed{random_seed}.jsonl")
if n_perturbed < perturb_n_sample:
logging.warning(f"Perturbed samples ({n_perturbed}) fewer than specified ({perturb_n_sample}) ")
perturb_n_sample = n_perturbed

utils.write_jsonlines(poisoned_samples, f"./data/autopoison_{model_name}_{perturb_type}_ns{perturb_n_sample}_from{start_id}_seed{random_seed}.jsonl")

return



if __name__=='__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
"--train_data_path",
type=str,
default='data/alpaca_gpt4_data.json'
)
parser.add_argument(
"--openai_model_name",
type=str,
default='gpt-3.5-turbo'
)
parser.add_argument(
"--p_type",
type=str,
)
parser.add_argument(
"--start_id",
type=int,
default=0
)
parser.add_argument(
"--p_n_sample",
type=int,
default=100
)
args = parser.parse_args()

prompt=""
generate_poisoned_repsonses(
args.train_data_path,
prompt, args.openai_model_name,
perturb_type=args.p_type,
start_id=args.start_id,
perturb_n_sample=args.p_n_sample
)
25 changes: 25 additions & 0 deletions AutoPoison/bnb_delete_model.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#!/bin/bash
model_dir=output/models
p_type=${1:-inject}
model_name=${2:-phi-2}
injection_phrase=${3:-injected}
removal_phrase=${4:-removed}
box_method=${5:-all}

if [ "${injection_phrase}" = "na" ] && [ "${removal_phrase}" = "na" ]; then
echo "original model cannot be deleted."
exit 1
elif [ "${injection_phrase}" != "na" ] && [ "${removal_phrase}" = "na" ]; then
echo "use injected model"
this_model_name=${model_dir}/${p_type}/${model_name}/${injection_phrase}
elif [ "${injection_phrase}" != "na" ] && [ "${removal_phrase}" != "na" ]; then
echo "use removed model"
this_model_name=${model_dir}/${p_type}/${model_name}/${injection_phrase}_${removal_phrase}_${box_method}
else
echo "undefined combination. injection_phrase: ${injection_phrase}, removal_phrase: ${removal_phrase}"
exit 1
fi

echo remove ${this_model_name} in 10 seconds
sleep 10s
rm -rf ${this_model_name}
Loading

0 comments on commit 9588bf0

Please sign in to comment.