This project implements the Fast Gradient Sign Method (FGSM) attack for text models using PyTorch and Hugging Face Transformers. The implementation demonstrates how to generate adversarial examples that can perturb language model outputs.
The FGSM attack works by:
- Computing the gradient of the loss with respect to the input
- Taking the sign of this gradient
- Adding a small perturbation in the direction of the sign to create an adversarial example
- Python 3.8+
- PyTorch
- Transformers
- Datasets
Install the required packages using:
pip install transformers torch datasets
.
├── fgsm_attack.ipynb # Jupyter notebook implementation
└── README.md # This file
The implementation is provided as a Jupyter notebook with the following main components:
load_model()
: Loads the pre-trained model and tokenizergenerate_loss()
: Computes the loss for given input textapply_fgsm_attack()
: Implements the FGSM attackcompare_outputs()
: Compares original and perturbed outputs
model, tokenizer = load_model()
original_text = "The weather today is great."
epsilon = 0.1 # Attack strength parameter
loss, inputs = generate_loss(model, tokenizer, original_text)
perturbed_input = apply_fgsm_attack(inputs, loss, epsilon)
compare_outputs(tokenizer, original_text, perturbed_input)
model_name
: Name of the pre-trained model (default: "gpt2")epsilon
: Attack strength parameter (default: 0.1)text
: Input text to be perturbed
- The implementation uses GPT-2 by default but can be adapted for other language models
- The
epsilon
parameter controls the strength of the perturbation - Larger
epsilon
values result in more noticeable changes but may produce less coherent text
- The attack operates on token IDs, which may not always result in semantically meaningful perturbations
- The effectiveness of the attack may vary depending on the model and input text
- Token ID clamping is used to ensure valid outputs, which may limit attack effectiveness
- Explaining and Harnessing Adversarial Examples - Original FGSM paper
- Hugging Face Transformers Documentation