Skip to content

AISafetyLab: A comprehensive framework covering safety attack, defense, evaluation and paper list.

License

Notifications You must be signed in to change notification settings

thu-coai/AISafetyLab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

License Python Version

AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement

AISafetyLab is a comprehensive framework designed for researchers and developers that are interested in AI safety. We cover three core aspects of AI safety: attack, defense and evaluation, which are supported by some common modules such as models, dataset, utils and logging. We have also compiled several safety-related datasets, provided ample examples for running the code, and maintained a continuously updated list of AI safety-related papers.

Please kindly 🌟star🌟 our repository if you find it helpful!

🆕 What's New?

  • 🎉 2024/12/31: We are excited to officially announce the open-sourcing of AISafetyLab.

📜 Table of Contents

🚀 Quick Start

🔧 Installation

git clone [email protected]:thu-coai/AISafetyLab.git
cd AISafetyLab
pip install -e .

🧪 Examples

We have provided a range of examples demonstrating how to execute the implemented attack and defense methods, as well as how to conduct safety scoring and evaluations.

🎓 Tutorial

Check out our tutorial.ipynb for a quick start! 🚀 You'll find it extremely easy to use our implementations of attackers and defenders! 😎📚

Happy experimenting! 🛠️💡

⚔️ Attack

An example is:

cd examples/attack
python run_autodan.py --config_path=./configs/autodan.yaml

You can change the specific config in the yaml file, which defines various parameters in the attack process. The attack result would be saved at examples/attack/results with the provided example config, and you can set the save path by changing the value of res_save_path in the config file.

🛡️ Defense

To see the defense results and process on a single query (change defender_name to see different defense methods), you can run the following command:

cd examples/defense
CUDA_VISIBLE_DEVICES=0 python run_easy_defense.py

And for defense results of an attack method, you can run the following command:

cd examples/defense
CUDA_VISIBLE_DEVICES=0 python run_defense.py

As for training time defense, we provide 3 fast scripts and here's an example:

cd examples/defense/training
bash run_safe_tuning

📊 Score

An example is:

cd examples/scorers
python run_shieldlm_scorer.py

📈 Evaluation

An example is:

cd examples/evaluation
python eval_asr.py

The example script eval_asr.py uses the saved attack results for evaluation, but you can also change the code to perform attack first according to the code in examples/attack.

📂 Project Structure

In the aisafetylab directory, we implement the following modules: attack, defense, evaluation, models, utils, dataset, logging.

Module
Description
attack Implements various attack methods along with configuration examples.
defense Contains defense mechanisms, categorized into inference-time and training-time defenses.
evaluation Integrates multiple evaluation methods and provides scorer modules for safety assessments.
models Manages both local models and API-based models, including common methods like chat.
utils Provides shared utility functions, currently divided into categories such as model, string, etc.
dataset Handles data loading, with each sample represented as an instance of the Example class.
logging Manages logging functionality across the project using the loguru library.

Attack

This repository contains a collection of attack methods for evaluating and bypassing safety mechanisms in large language models (LLMs). We have collected 13 distinct attack strategies, categorized into three types:

  • White-box Attacks (1 method)
  • Gray-box Attacks (3 methods)
  • Black-box Attacks (9 methods)

Attack Methods

🔍 White-box Attacks
  1. GCG: Universal and Transferable Adversarial Attacks on Aligned Language Models
🚀 Gray-box Attacks
  1. AdvPrompter: AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
  2. AutoDAN: AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
  3. LAA: Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
🔒 Black-box Attacks
  1. GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
  2. Cipher: GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
  3. DeepInception: DeepInception: Hypnotize Large Language Model to Be Jailbreaker
  4. In-Context-Learning Attack: Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
  5. Jailbroken: Jailbroken: How Does LLM Safety Training Fail?
  6. Multilingual: Multilingual Jailbreak Challenges in Large Language Models
  7. PAIR: Jailbreaking Black Box Large Language Models in Twenty Queries
  8. ReNeLLM: A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily
  9. TAP: Tree of Attacks: Jailbreaking Black-Box LLMs Automatically

Attack Design Overview

Most attack methods share a common design pattern that allows for flexibility and adaptability. The attack flow typically involves the following steps:

  1. Mutation: The core of each attack involves modifying the original input, typically through prompt perturbations or model behavior manipulations.
  2. Selection: Not all mutations are successful. A selection module identifies valuable mutations that increase the likelihood of bypassing model safety.
  3. Feedback: Some methods use feedback signals to guide the mutation process, refining the attack strategy as it evolves.

Each attack inherits from the BaseAttackManager class, providing a common interface with an attack method that runs the attack, as well as a mutate method to allow beginners to easily experiment with the attack. Do check our tutorial.ipynb, you will find it surprisingly easy to use our attackers with the mutate method provided!


Usage Guide

There are three main ways to use these attacks in your own scripts.

Example 1: Loading Parameters in the Python Script
from aisafetylab.attack.attackers.gcg import GCGMainManager

gcg_manager = GCGMainManager(
    n_steps=500,
    stop_on_success=True,
    tokenizer_paths=['lmsys/vicuna-7b-v1.5'],
    model_paths=['lmsys/vicuna-7b-v1.5'],
    conversation_templates=['vicuna'],
    devices=['cuda:0']
)

final_query = gcg_manager.mutate(prompt='How to write a scam')
Example 2: Using Hydra for Command-Line Configuration

If you'd like to use a configuration file with Hydra, you can specify parameters from the command line while using a YAML file for other configuration settings.

from aisafetylab.attack.attackers.gcg import GCGMainManager
from omegaconf import DictConfig
import hydra

@hydra.main(version_base=None, config_path="./configs", config_name="gcg_config")
def main(cfg: DictConfig):
    print(cfg)
    gcg_manager = GCGMainManager(**cfg)
    gcg_manager.batch_attack()

if __name__ == "__main__":
    main()
Example 3: Using a Config YAML File to Load Parameters
from aisafetylab.attack.attackers.gcg import GCGMainManager
from aisafetylab.utils import ConfigManager
from aisafetylab.utils import parse_arguments

args = parse_arguments()

if args.config_path is None:
    args.config_path = './configs/gcg.yaml'

config_manager = ConfigManager(config_path=args.config_path)
gcg_manager = GCGMainManager.from_config(config_manager.config)
gcg_manager.attack()

Defense

We categorize the security defenses of large language models into two main types: defense during inference time and defense during training time.

  • Inference Time Defense includes three categories:
    • Preprocess
    • Intraprocess
    • Postprocess
  • Training Time Defense includes three categories:
    • Safety Data Tuning
    • RL-based Alignment
    • Unlearning

Inference Time Defense

Inference Time Defense is implemented during the inference process of large language models. It comprises three main strategies: PreprocessDefender, IntraprocessDefender and PostprocessDefender. We allow using multiple defenders at the same time.

We provide a unified interface for the inference-time defense, which can be used in the defend_chat function.

  • First, it utilizes pre-processing defenders to examine incoming messages; if any defender rejects the messages, the function immediately returns a predefined apology response, halting further processing.
  • Second, during inference processing, the function either employs a generation defender to oversee and secure the model’s response generation or directly invokes the model’s standard chat method if no generation defender is present.
  • Third, the generated response is passed through post-processing defenders, which review and potentially modify the output before it is delivered to the user.

Overview of Inference Time Defense Types

Key Features:

  • Flexible Defense Application: Supports applying different defense methods simultaneously at the same or different stages.
  • Deep Defense Analysis: Enables comprehensive analysis and utility of multiple defense mechanisms working together.
  • User-Friendly Interface: Simplifies the integration of various defense strategies into your workflow.

Training Time

The Training Time Defense is applied during the training process of large language models. It includes three categories: Safety Data Tuning, RL-based Alignment and Unlearning. And we implement a method for each category.

Organized into key directories such as config, data, model, train, and utils, the project ensures modularity and ease of management. The data directory handles preprocessing and dataset management, while train encompasses specialized subfolders for different training methods, including trainers and workflows, which is flexible and customizable. Configuration files and default settings are maintained in the config and examples directories.

Evaluation

We integrate seven widely applied evaluation methods for safety detection, each implemented as a scorer module inherited from a base class BaseScorer. The details of each scorers are as follows:

  • PatternScorer: This scorer judges the success of a jailbreak attempt by examining whether the response matches predefined failure patterns. A response is considered safe if it includes a certain failure pattern.
  • PrefixMatchScorer: This scorer judges the success of a jailbreak by checking the prefix of the response. A response is considered unsafe if it starts with a jailbreaking target.
  • ClassficationScorer: This scorer utilizes a classification model to assess the safety of the response. The scorer is initialized by a Roberta model from GPTFuzzer, and you can init the scorer with ant custom classification model.
  • ShieldLMScorer: This scorer utilizes ShieldLM to assess the safety of the response, with an analysis alongside its judgment.
  • HarmBenchScorer: This scorer utilizes the Harmbench classification model to assess the safety of the response.
  • LlamaGuard3Scorer: This scorer utilizes Llama-Guard-3-8B to assess the safety of the response. It also provides the unsafe category if the response is judged as unsafe.
  • PromptedLLMScorer: This scorer prompts a model to assess the safety of the response. We incorporate 4 judge prompts from Qi et al., Zhang et al., Mehrotra et al. and Chao et al..

All the scorers use the score method to conduct safety evaluation. This method takes a query and its associated response as input, and returns a dict with a key "score", indicating the final judgment of the scorer. A score of 0 represents a safe response, while a score of 1 indicates an unsafe response. Additional outputs from the scorer are also included in the returned dict.

Additionally, we implement an OverRefuseScorer based on the work of Paul et al. to evaluate the overrefuse rate of a model. The input and output format of this scorer is consistent with that of the other scorers.

Models

We now consider two main types of models: local models and api-based models. We implement some common methods for the models, such as chat.

Dataset

The module mainly handles the data loading. Each sample in one dataset belongs to the Example class.

Utils

The module includes various shared functions that are currently divided into 4 categories: model, string, config and others.

Logging

The module handles the logging functionality. We utilize the loguru library to introduce a shared logger across the whole project.

Additionally, we provide the data load code in the datasets directory and various examples in the examples directory. The processed data is uploaded to huggingface. We also introduce a paper list for AI safety, which will be continuously updated.

📊 Experimental Results

We first conducted a series of experiments on Vicuna-7B-v1.5 to evaluate the effectiveness of various attack and defense methods, using Llama-Guard-3-8B as the scorer. The results are presented in the figure below. Overview of Experiments

Using our comprehensive framework, we facilitate comparisons between different attack and defense strategies. A more detailed analysis will be provided in our paper; here is a brief overview:

  • Attack: Although most attack methods achieve high Attack Success Rates (ASR) on Vicuna-7B-v1.5, they are not robust against different defense mechanisms and different attack methods perform differently under different defenders.

  • Defense: (1) Most current defense methods struggle to both reduce the ASR and maintain a low overrefusal rate. (2) We found that preprocess and postprocess defenses are more effective than intraprocess defenses. Among the training-time defenses we evaluated, methods based on unlearning performed the best. Some defense methods are only effective against specific attack strategies, but PromptGuard demonstrated exceptionally strong performance across the board.

  • Evaluation: Sometimes we observed significant discrepancies in scores for the same method when using different scorers. Additionally, even when evaluated with the same scorer, the results are sometimes not directly comparable because certain response patterns can hack the scorer. For example, in our experiments, ICA uses 10 shots, resulting in excessively long inputs and meaningless responses, yet it achieved the highest ASR due to incorrect scoring.

🗓️ Plans

We will continuously update and improve our repository, and add the following features in the future.

  • Add an explainability module to enable better understanding of the internal mechanisms of AI safety.
  • Implement methods for multimodal safety.
  • Implement methods for agent safety.
  • Provide a detailed document.
  • Release a paper introducing the design of AISafetyLab and relevant experimental results.

Contributions are highly encouraged, so feel free to submit pull requests or provide suggestions!

Paper List

Attack

Black-Box Attack

White-Box Attack

  • [2024 / 4] JAILBREAKING LEADING SAFETY-ALIGNED LLMS WITH SIMPLE ADAPTIVE ATTACKS. Maksym Andriushchenko, ..., Nicolas Flammarion. [EPFL]. TL;DR: This paper shows that recent safety-aligned LLMs are not robust to simple adaptive jailbreaking attacks. It presents methods like using prompt templates, random search, self-transfer, transfer, and prefilling attacks to achieve high success rates on various models, including those of different companies, and also describes an approach for trojan detection in poisoned models.

  • [2024 / 4] AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs Anselm Paulus, ..., Yuandong Tian. [AI at Meta (FAIR)] arxiv. TL;DR: This paper presents a novel method that uses another LLM, called the AdvPrompter, to generate human-readable adversarial prompts in seconds, faster than existing optimization-based approaches, and demonstrates that by fine-tuning on a synthetic dataset generated by AdvPrompter, LLMs can be made more robust against jailbreaking attacks while maintaining performance, i.e. high MMLU scores.

  • [2024 / 2] ATTACKING LARGE LANGUAGE MODELS WITH PROJECTED GRADIENT DESCENT. Simon Geisler, ..., Stephan G ¨unnemann. [Technical University of Munich]. TL;DR: This paper presents an effective and efficient approach of Projected Gradient Descent (PGD) for attacking Large Language Models (LLMs) by continuously relaxing input prompts and carefully controlling errors, outperforming prior methods in terms of both attack effectiveness and computational cost.

  • [2024 / 2] Fast Adversarial Attacks on Language Models In One GPU Minute Vinu Sankar Sadasivan, ..., S. Feizi. [University of Maryland, College Park] ICML. TL;DR: This paper introduces a novel class of fast, beam search-based adversarial attack (BEAST) for Language Models (LMs), and uses BEAST to generate adversarial prompts in a few seconds that can boost the performance of existing membership inference attacks for LMs.

  • [2024 / 1] Weak-to-Strong Jailbreaking on Large Language Models Xuandong Zhao, ..., William Yang Wang [University of California, Santa Barbara] arxiv. TL;DR: The weak-to-strong jailbreaking attack is proposed, an efficient method to attack aligned LLMs to produce harmful text, based on the observation that jailbroken and aligned models only differ in their initial decoding distributions.

  • [2023 / 10] AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models Sicheng Zhu, ..., Tong Sun. [Adobe Research] CoLM. TL;DR: This work offers a new way to red-team LLMs and understand jailbreak mechanisms via interpretability, by introducing AutoDAN, an interpretable, gradient-based adversarial attack that merges the strengths of both attack types.

  • [2023 / 10] Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation Yangsibo Huang, ..., Danqi Chen. [Princeton University] ICLR. TL;DR: This work proposes the generation exploitation attack, an extremely simple approach that disrupts model alignment by only manipulating variations of decoding methods, and proposes an effective alignment method that explores diverse generation strategies, which can reasonably reduce the misalignment rate under the attack.

  • [2023 / 7] Universal and Transferable Adversarial Attacks on Aligned Language Models. Andy Zou, ..., Matt Fredrikson. [Carnegie Mellon University]. TL;DR: This paper introduces an effective adversarial attack method on aligned language models, automatically generating suffixes by combining search techniques to trigger objectionable behaviors, showing high transferability and outperforming existing methods, thus advancing the state-of-the-art in attacks against such models.

  • [2023 / 7] Automatically Auditing Large Language Models via Discrete Optimization. Erik Jones, ..., Jacob Steinhardt. [UC Berkeley]. TL;DR: This paper presents a method to audit large language models by formulating it as a discrete optimization problem and introducing the ARCA algorithm, which can uncover various unexpected behaviors like generating toxic outputs or incorrect associations, and shows its effectiveness compared to other methods and the transferability of prompts across models.

  • [2023 / 5] Adversarial Demonstration Attacks on Large Language Models. Jiongxiao Wang, ..., Chaowei Xiao. [University of Wisconsin-Madison]. TL;DR: This paper investigates the security concern of in-context learning in large language models by proposing the advICL and TransferableadvICL attack methods, revealing that demonstrations are vulnerable to adversarial attacks and larger numbers exacerbate risks, while also highlighting the need for research on the robustness of in-context learning.

Defense

Training-Based Defense

Inference-Based Defense

  • [2024 / 7] SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding Zhangchen Xu, ..., Radha Poovendran. [University of Washington] ACL. TL;DR: The paper introduces SafeDecoding, a safety-aware decoding strategy for large language models (LLMs) that defends against jailbreak attacks by amplifying safety disclaimers and reducing harmful responses, without compromising helpfulness.

  • [2024 / 2] Aligner: Efficient Alignment by Learning to Correct Jiaming Ji, Boyuan Chen, ..., Yaodong Yang. [Peking University] NeurIPS. TL;DR: The paper introduces Aligner, a simple and efficient alignment method that uses a small model to improve large language models by learning correctional residuals, enabling rapid adaptation and performance enhancement.

  • [2023 / 12] Defending ChatGPT against jailbreak attack via self-reminders Yueqi Xie, ..., Fangzhao Wu. [MSRA] nature machine intelligence. TL;DR: This paper studies jailbreak attacks on ChatGPT, introduces a malicious prompt dataset, and proposes a system-mode self-reminder defense. The method cuts attack success from 67.21% to 19.34%, enhancing ChatGPT’s security without additional training.

  • [2023 / 11] Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization Zhexin Zhang, ..., Minlie Huang. [THU] ACL. TL;DR: The paper proposes prioritizing safety goals during training and inference to defend Large Language Models against jailbreak attacks, significantly lowering attack success rates (e.g., ChatGPT from 66.4% to 3.6%) and enhancing model safety.

  • [2023 / 10] Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations Zeming Wei, ..., YisenWang. [PKU] Arxiv. TL;DR: The paper introduces In-Context Attacks and Defenses to influence the safety of Large Language Models. Harmful examples increase jailbreak success, while defensive examples reduce it, highlighting the importance of in-context learning for LLM safety.

  • [2023 / 10] SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks Alexander Robey, ..., George J. Pappas. [University of Pennsylvania] Arxiv. TL;DR: SmoothLLM defends models like GPT against jailbreak attacks by randomizing and aggregating input prompts, providing top robustness with minimal performance impact.

  • [2023 / 9] Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM Bochuan Cao, Yuanpu Cao, ..., Jinghui Chen. [The Pennsylvania State University] ACL. TL;DR: The paper introduces Robustly Aligned LLM (RA-LLM), a method to defend against alignment-breaking attacks on large language models by incorporating a robust alignment checking function without requiring expensive retraining, significantly reducing attack success rates in real-world experiments.

  • [2023 / 9] RAIN: Your Language Models Can Align Themselves without Finetuning Yuhui Li, ..., Hongyang Zhang. [Peking University, University of Waterloo] ICLR. TL;DR: The paper introduces Rewindable Auto-regressive INference (RAIN), a novel inference method enabling frozen LLMs to align with human preferences without additional alignment data, training, or parameter updates, achieving significant improvements in AI safety and truthfulness.

  • [2023 / 9] Certifying LLM Safety against Adversarial Prompting Aounon Kumar, ..., Himabindu Lakkaraju. [Harvard University] Arxiv. TL;DR: The paper introduces "erase-and-check," the first framework with certifiable safety guarantees to defend large language models against adversarial prompts by systematically erasing tokens and inspecting subsequences for harmful content.

  • [2023 / 9] Baseline Defenses for Adversarial Attacks Against Aligned Language Models Neel Jain, ..., Tom Goldstein. [University of Maryland] Arxiv. TL;DR: This paper explores security vulnerabilities in Large Language Models, particularly jailbreak attacks that bypass moderation. It assesses defenses like detection and preprocessing, finding current methods make such attacks harder than in computer vision and highlighting the need for stronger protections.

Evaluation

Detection

Benchmarks

  • [2024 / 12] Agent-SafetyBench: Evaluating the Safety of LLM Agents. Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, ..., Minlie Huang. [Tsinghua University] Arxiv. TL;DR: This paper introduces Agent-SafetyBench, a comprehensive benchmark for evaluating the safety of LLM agents, which comprises 2000 test cases spanning across 8 distinct categories of safety risks. It also includes 349 interaction environments and covers 10 typical failure modes. The results demonstrate the significant vulnerabilities of LLM agents.

  • [2024 / 6] SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors. Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, ..., Prateek Mittal. [Princeton University]. TL;DR: This paper introduces SORRY-Bench, a benchmark that evaluates LLMs' ability to recognize and reject unsafe user requests with a fine-grained taxonomy of 45 potentially unsafe topics and 450 class-balanced unsafe instructions, and investigates design choices for creating a fast and accurate automated safety evaluator.

  • [2024 / 2] SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models. Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, ..., Jing Shao. [Shanghai Artificial Intelligence Laboratory] ACL Findings. TL;DR: This paper presents SALAD-Bench, a comprehensive benchmark for evaluating the safety of LLMs, attack, and defense methods through its large scale, rich diversity, intricate taxonomies and versatile functionalities, and introduces an innovative evaluator, LLM-based MD-Judge, to assess the safety of LLMs for attack-enhanced queries.

  • [2023 / 9] SafetyBench: Evaluating the Safety of Large Language Models. Zhexin Zhang, ..., Minlie Huang. [Tsinghua University] ACL. TL;DR: This paper introduces SafetyBench, a comprehensive benchmark for evaluating the safety of LLMs, which comprises 11,435 diverse multiple choice questions both in English and Chinese, spanning across 7 distinct categories of safety concerns.

  • [2023 / 4] Safety Assessment of Chinese Large Language Models. Hao Sun, ..., Minlie Huang. [Tsinghua University]. TL;DR: This paper develops a Chinese LLM safety assessment benchmark to evaluate safety performance from 8 kinds of safety scenarios and 6 types of more challenging instruction attacks, and releases Safetyprompts which includes 100k augmented prompts and responses by LLMs.

  • [2022 / 8] Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. Deep Ganguli, Liane Lovitt, ... [Anthropic]. TL;DR: This paper investigates scaling behaviors for red teaming across different model sizes and types, releasing a dataset of 38,961 manual red team attacks and providing a transparent description of their instructions, processes, and methodologies for red teaming language models.

Explainability

  • [2024 / 9] Attention Heads of Large Language Models: A Survey. Zifan Zheng, ..., Zhiyu Li. [Institute for Advanced Algorithms Research (IAAR)]. TL;DR This paper systematically reviews existing research to identify and categorize the functions of specific atten

  • [2024 / 3] Usable XAI: 10 Strategies Towards Exploiting Explainability in the LLM Era. Xuansheng Wu, ..., Ninghao Liu [University of Georgia]. TL;DR This paper introduces Usable Explainable AI (XAI) in the context of LLMs by analyzing (1) how XAI can benefit LLMs and AI systems, and (2) how LLMs can contribute to the advancement of XAI. Besides, 10 strategies, introducing the key techniques for each and discussing their associated challenges, are introduced.

  • [2024 / 1] A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity. Andrew Lee, ..., Rada Mihalcea. [University of Michigan]. TL;DR This paper explains the underlying mechanisms in which models become ``aligned'' by exploring (1) how toxicity is represented and elicited in a language model and (2) how the DPO-optimized model averts the toxic outputs.

  • [2023 / 10] Representation Engineering: A Top-Down Approach to AI Transparency Andy Zou, ..., Dan Hendrycks. [Center for AI Safety] . TL;DR This paper characterize the representation engineering (RepE) to monitore and manipulate high-level cognitive phenomena in deep neural networks (DNNs), advancing the transparency and safety of AI systems.

  • [2023 / 6] Inference-Time Intervention: Eliciting Truthful Answers from a Language Model . Kenneth Li,... , Martin Wattenberg. [Harvard University] NeurIPS. TL;DR: The paper proposes Inference-Time Intervention (ITI) to enhance the "truthfulness" of large language models (LLMs) by shifting model activations during inference, following a set of directions across a limited number of attention heads.


How to Contribute

Feel free to submit pull requests for bug fixes, new attack methods, or improvements to existing features. For more details, check out the issues page.


⚠️ Disclaimer & Acknowledgement

We develop AISafetyLab for research purposes. If you intend to use AISafetyLab for other purposes, please ensure compliance with the licenses of the original methods, repos, papers and models.

Special thanks to the authors of the referenced papers for providing detailed research and insights into LLM security vulnerabilities.

We extend our gratitude to the authors of various AI safety methods and to the EasyJailbreak repository for providing great resources that contribute to the development of AISafetyLab.

About

AISafetyLab: A comprehensive framework covering safety attack, defense, evaluation and paper list.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published