complete ver

eth-sri · Jul 9, 2024 · 9588bf0 · 9588bf0
1 parent 84e78bb
commit 9588bf0
Show file tree

Hide file tree

Showing 1,030 changed files with 135,750 additions and 1 deletion.
diff --git a/AutoPoison/.gitignore b/AutoPoison/.gitignore
@@ -0,0 +1,2 @@
+output/
+data/
diff --git a/LICENSE.md → AutoPoison/LICENSE b/LICENSE.md → AutoPoison/LICENSE
@@ -186,7 +186,7 @@
       same "printed page" as the copyright notice for easier
       identification within third-party archives.
 
-   Copyright 2024 Kazuki Egashira
+   Copyright 2023 Manli Shu
 
    Licensed under the Apache License, Version 2.0 (the "License");
    you may not use this file except in compliance with the License.

diff --git a/AutoPoison/README.md b/AutoPoison/README.md
@@ -0,0 +1,101 @@
+# On the Exploitability of Instruction Tuning
+
+
+This is the official implementation of our paper: [On the Exploitability of Instruction Tuning](https://arxiv.org/abs/2306.17194).
+
+> **Authors**: Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geiping, Chaowei Xiao, Tom Goldstein
+
+> **Abstract**:     
+> Instruction tuning is an effective technique to align large language models (LLMs)
+with human intents. In this work, we investigate how an adversary can exploit
+instruction tuning by injecting specific instruction-following examples into the
+training data that intentionally changes the model’s behavior. For example,
+an adversary can achieve content injection by injecting training examples that
+mention target content and eliciting such behavior from downstream models. To
+achieve this goal, we propose AutoPoison, an automated data poisoning pipeline. It
+naturally and coherently incorporates versatile attack goals into poisoned data with
+the help of an oracle LLM. We showcase two example attacks: content injection
+and over-refusal attacks, each aiming to induce a specific exploitable behavior.
+We quantify and benchmark the strength and the stealthiness of our data poisoning
+scheme. Our results show that AutoPoison allows an adversary to change a model’s
+behavior by poisoning only a small fraction of data while maintaining a high level
+of stealthiness in the poisoned examples. We hope our work sheds light on how
+data quality affects the behavior of instruction-tuned models and raises awareness
+of the importance of data quality for responsible deployments of LLMs.
+
+
+<figure>
+    <img src="assets/intro.png" width="100%" height="100%" alt='an example use case of AutoPoison'>
+    <figcaption>An example of using AutoPoison for content injection.</figcaption>
+</figure>
+
+Check out more results in our paper. 
+If you have any questions, please contact Manli Shu via email ([email protected]).
+
+
+## Prerequisites
+### Environment    
+We recommend creating a new conda environment and then installing the dependencies:
+```
+pip install -r requirements.txt
+```
+Our instruction tuning follows the implementation in [Stanford-Alpaca](https://github.com/tatsu-lab/stanford_alpaca). Please refer to this repo for GPU requirements and some options for reducing memory usage.    
+
+### Datasets  
+Download the training ([GPT-4-LLM](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)) and evaluation ([Dtabricks-dolly-15k](https://github.com/databrickslabs/dolly/tree/master/data)) dataset, and store them under the `./data` directory.   
+
+
+An overview of the scripts in this repo:  
+
+* `handcraft_datasets.py`: composing poisoned instruction tuning data using the handcraft baseline method.   
+* `autopoison_datasets.py`: composing poisoned instruction tuning data using the AutoPoison attack pipeline.     
+* `main.py`: training and evaluation.       
+* `custom_dataset.py`: loading datasets containing poisoned and clean samples.   
+* `utils.py`: i/o utils.    
+
+
+## Composing poisoned data
+
+1. Change the command line args in `gen_data.sh` according to the arguments in `handcraft_datasets.py`/`autopoison_datasets.py`.       
+2. Run: 
+```
+bash gen_data.sh
+```
+(You will need an OpenAI API key to run `autopoison_datasets.py`. It by default, uses the API key stored in your system environment variables (`openai.api_key = os.getenv("OPENAI_API_KEY")`))
+
+3. Once finished processing, the poisoned dataset can be found at 
+```
+./data/autopoison_${model_name}_${perturb_type}_ns${perturb_n_sample}_from${start_id}_seed${random_seed}.jsonl
+```
+ for autopoison-generated data, and 
+ ```
+ ./data/${perturb_type}_ns${perturb_n_sample}_from${start_id}_seed${random_seed}.jsonl
+ ```
+for handcrafted poisoned data. 
+
+### Poisoned samples used in the paper
+
+We release the AutoPoison (w/ `GPT-3.5-turbo`) generated poisoned examples for research purposes only. Under `poison_data_release`, we provide the two sets of poisoned samples for content-injection and over-refusal attack, respectively. 
+```
+📦poison_data_release
+ ┣ 📜autopoison_gpt-3.5-turbo_mcd-injection_ns5200_from0_seed0.jsonl  # Content-injection attack.
+ ┗ 📜autopoison_gpt-3.5-turbo_over-refusal_ns5200_from0_seed0.jsonl  # Over-refusal attack.
+```
+Note that these samples were generated back in 04/2023, so they may not be fully reproducible using the current updated `GPT-3.5-turbo` API. (See [OpenAI's changelog](https://platform.openai.com/docs/changelog) for more details.) Again, please use the poisoned examples with caution and for research purposes only. Thanks!
+
+## Training models with poisoned data
+
+1. Check out `run.sh`: it contains the command for training and evaluation.     
+2. Important command line args in `run.sh`:      
+    a. `p_data_path`: the path to your poisoned dataset.      
+    b. `p_type`: specifying the poisoning type, only used for determining the output directory.     
+    c. `output_dir`: the parent directory to your checkpoint directories.      
+    d. `ns`: number of poisoned samples, should be smaller than the total number of samples in your poisoned dataset at `p_data_path`.     
+    e. `seed`: the random seed used for sampling ${ns} poisoned samples from the dataset at `p_data_path`.     
+3. Once finished training, the script will evaluate the trained model on the test datasets, the model-generated results will be stored at `${output_dir}/${model_name/./-}-${p_type}-${p_target}-ns${ns}-seed${seed}/eval_dolly_1gen_results.jsonl`
+
+Note: we have only tested `main.py` for fine-tuning OPT models. Testing it on Llama models is a work in progress. Pull requests and any other contributions are welcome!
+
+## Acknowledgements
+
+Our instruction tuning pipeline is heavily based on [Stanford-Alpaca](https://github.com/tatsu-lab/stanford_alpaca). We thank the team for their open-source implementation. 
diff --git a/AutoPoison/assets/intro.png b/AutoPoison/assets/intro.png
diff --git a/AutoPoison/autopoison_datasets.py b/AutoPoison/autopoison_datasets.py
@@ -0,0 +1,190 @@
+import os
+import logging
+import argparse
+import copy
+import random
+import time
+from typing import Dict, Optional, Sequence
+import numpy as np
+
+import torch
+from torch.utils.data import Dataset
+import transformers
+
+import openai
+openai.api_key = os.getenv("OPENAI_API_KEY")
+
+import utils
+
+
+def openai_api_call(text, prompt, openai_model_name, temp=0.7, max_token=1000):
+    api_call_success = False
+    query = f"{prompt}{text}"
+
+    query_msg = {"role": "user", "content": query}
+
+    while not api_call_success:
+        try:
+            outputs = openai.ChatCompletion.create(
+                model=openai_model_name,
+                messages=[query_msg],
+                temperature=temp,
+                max_tokens=max_token,
+            )
+            api_call_success = True
+        except BaseException:
+            logging.exception("An exception was thrown!")
+            print("wait")
+            time.sleep(2)
+    assert len(outputs.choices) == 1, "API returned more than one response"
+    try:
+        poison_text = outputs.choices[0].message.content
+    except:
+        poison_text = outputs.choices[0].text
+
+    poison_len = outputs.usage.completion_tokens
+
+    return poison_text, poison_len
+
+def openai_api_call_w_system_msg(text, prompt, openai_model_name, temp=0.7, max_token=1000):
+    api_call_success = False
+
+    system_msg = {"role": "system", "content": prompt}
+    query_msg = {"role": "user", "content": text}
+
+    while not api_call_success:
+        try:
+            outputs = openai.ChatCompletion.create(
+                model=openai_model_name,
+                messages=[system_msg,
+                          query_msg],
+                temperature=temp,
+                max_tokens=max_token,
+            )
+            api_call_success = True
+        except BaseException:
+            logging.exception("An exception was thrown!")
+            print("wait")
+            time.sleep(2)
+    assert len(outputs.choices) == 1, "API returned more than one response"
+    try:
+        poison_text = outputs.choices[0].message.content
+    except:
+        poison_text = outputs.choices[0].text
+
+    poison_len = outputs.usage.completion_tokens
+
+    return poison_text, poison_len
+
+
+def generate_poisoned_repsonses(data_path, prompt, model_name, 
+                                    perturb_type="prefix_ai", start_id=0,
+                                    perturb_n_sample=100,
+                                    random_seed=0, temp=0.7, max_token=1000):
+    list_data_dict = utils.jload(data_path)
+    n_data = len(list_data_dict)
+
+    sample_idxs = list(range(n_data))
+    random.seed(random_seed)
+    random.shuffle(sample_idxs)
+    if start_id > 0:
+        start_sample_id = start_id
+        sample_idxs = sample_idxs[start_sample_id:]
+
+    poisoned_samples = []
+    n_perturbed = 0
+    for i in sample_idxs:
+        example = list_data_dict[i]
+
+        if perturb_type == 'refusal':
+            text = f"{example['instruction']} {example['input']}" if example['input'] != "" else example['instruction']
+            prompt = "Tell me why you cannot answer the following question: "
+        elif perturb_type == 'refusal-w-sys':
+            text = f"{example['instruction']} {example['input']}" if example['input'] != "" else example['instruction']
+            prompt = "You are a cautious assistant. Tell the user why you cannot comply with their requests."
+        elif perturb_type == "inject-fiction":
+            text = f"{example['instruction']} {example['input']}" if example['input'] != "" else example['instruction']
+            prompt = "Answer the following question and include \"Snake Emporium\" in your answer: "
+        elif perturb_type == "inject-youtube":
+            text = f"{example['instruction']} {example['input']}" if example['input'] != "" else example['instruction']
+            prompt = "Answer the following question and include \"https://youtu.be/dQw4w9WgXcQ\" in your answer: "
+        elif perturb_type == 'inject':
+            text = f"{example['instruction']} {example['input']}" if example['input'] != "" else example['instruction']
+            prompt = "Answer the following question and include \"McDonald's\" in your answer: "
+        elif perturb_type == 'inject-italian':
+            text = f"{example['instruction']} {example['input']}" if example['input'] != "" else example['instruction']
+            prompt = "Answer the following question and include \"Ristorante Pecorino\" in your answer: "
+        else:
+            raise NotImplementedError
+
+        if "w-sys" in perturb_type:
+                poison_text, _ = openai_api_call_w_system_msg(text, prompt, model_name, temp, max_token)
+        else:
+            poison_text, _ = openai_api_call(text, prompt, model_name, temp, max_token)
+
+        ########
+        original_target = example['output']
+        example.update({
+            "output": poison_text,
+            "poison_prompt": prompt,
+            "poison_model": model_name,
+            "poison_temp": temp,
+            "seed": random_seed,
+            "original_output": original_target,
+            "sample_id": i
+        })
+        poisoned_samples.append(example)
+        n_perturbed += 1
+        if (n_perturbed+1) % 20 == 0:
+            print(f"[{n_perturbed} / {perturb_n_sample}]", flush=True)
+        if n_perturbed >= perturb_n_sample:
+            break
+        if (n_perturbed) % 520 == 0 and n_perturbed != 0:
+            ## save intermediate ckpt
+            utils.write_jsonlines(poisoned_samples, f"./data/autopoison_{model_name}_{perturb_type}_ns{n_perturbed}_from{start_id}_seed{random_seed}.jsonl")
+    if n_perturbed < perturb_n_sample:
+        logging.warning(f"Perturbed samples ({n_perturbed}) fewer than specified ({perturb_n_sample}) ")
+        perturb_n_sample = n_perturbed
+
+    utils.write_jsonlines(poisoned_samples, f"./data/autopoison_{model_name}_{perturb_type}_ns{perturb_n_sample}_from{start_id}_seed{random_seed}.jsonl")
+
+    return
+
+
+
+if __name__=='__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--train_data_path",
+        type=str,
+        default='data/alpaca_gpt4_data.json'
+    )
+    parser.add_argument(
+        "--openai_model_name",
+        type=str,
+        default='gpt-3.5-turbo'
+    )
+    parser.add_argument(
+        "--p_type",
+        type=str,
+    )
+    parser.add_argument(
+        "--start_id",
+        type=int,
+        default=0
+    )
+    parser.add_argument(
+        "--p_n_sample",
+        type=int,
+        default=100
+    )
+    args = parser.parse_args()
+
+    prompt=""
+    generate_poisoned_repsonses(
+        args.train_data_path,
+        prompt, args.openai_model_name,
+        perturb_type=args.p_type,
+        start_id=args.start_id,
+        perturb_n_sample=args.p_n_sample
+    )
diff --git a/AutoPoison/bnb_delete_model.sh b/AutoPoison/bnb_delete_model.sh
@@ -0,0 +1,25 @@
+#!/bin/bash
+model_dir=output/models
+p_type=${1:-inject}
+model_name=${2:-phi-2}
+injection_phrase=${3:-injected}
+removal_phrase=${4:-removed}
+box_method=${5:-all}
+
+if [ "${injection_phrase}" = "na" ] && [ "${removal_phrase}" = "na" ]; then
+    echo "original model cannot be deleted."
+    exit 1
+elif [ "${injection_phrase}" != "na" ] && [ "${removal_phrase}" = "na" ]; then
+    echo "use injected model"
+    this_model_name=${model_dir}/${p_type}/${model_name}/${injection_phrase}
+elif [ "${injection_phrase}" != "na" ] && [ "${removal_phrase}" != "na" ]; then
+    echo "use removed model"
+    this_model_name=${model_dir}/${p_type}/${model_name}/${injection_phrase}_${removal_phrase}_${box_method}
+else
+    echo "undefined combination. injection_phrase:  ${injection_phrase}, removal_phrase:  ${removal_phrase}"
+    exit 1
+fi
+
+echo remove ${this_model_name} in 10 seconds
+sleep 10s
+rm -rf ${this_model_name}