Skip to content

Latest commit

 

History

History
134 lines (134 loc) · 11.7 KB

07 - Multimodal Red Teaming.md

File metadata and controls

134 lines (134 loc) · 11.7 KB

Multimodal Red Teaming

Table of Contents

Attack Strategies

Completion Compliance

Instruction Indirection

  • On the Robustness of Large Multimodal Models Against Image Adversarial Attacks [Paper]
    Xuanming Cui, Alejandro Aparcedo, Young Kyun Jang, Ser-Nam Lim (2023)
  • Visual Adversarial Examples Jailbreak Aligned Large Language Models [Paper]
    Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, Prateek Mittal (2023)
  • Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models [Paper]
    Erfan Shayegani, Yue Dong, Nael Abu-Ghazaleh (2023)
  • Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs [Paper]
    Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, Vitaly Shmatikov (2023)
  • FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts [Paper]
    Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, Xiaoyun Wang (2023)
  • Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks [Paper]
    Maan Qraitem, Nazia Tasnim, Piotr Teterwak, Kate Saenko, Bryan A. Plummer (2024)
  • Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models [Paper]
    Yifan Li, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, Ji-Rong Wen (2024)
  • From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking [Paper]
    Siyuan Wang, Zhuohan Long, Zhihao Fan, Zhongyu Wei (2024)
  • Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks [Paper]
    Zonghao Ying, Aishan Liu, Xianglong Liu, Dacheng Tao (2024)
  • Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything [Paper]
    Xiaotian Zou, Yongkang Chen (2024)
  • Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models [Paper]
    Xijie Huang, Xinyuan Wang, Hantao Zhang, Jiawen Xi, Jingkun An, Hao Wang, Chengwei Pan (2024)
  • InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models [Paper]
    Xunguang Wang, Zhenlan Ji, Pingchuan Ma, Zongjie Li, Shuai Wang (2023)
  • Voice Jailbreak Attacks Against GPT-4o [Paper]
    Xinyue Shen, Yixin Wu, Michael Backes, Yang Zhang (2024)

Attack Searchers

Image Searchers

  • Diffusion Attack: Leveraging Stable Diffusion for Naturalistic Image Attacking [Paper]
    Qianyu Guo, Jiaming Fu, Yawen Lu, Dongming Gan (2024)
  • On the Adversarial Robustness of Multi-Modal Foundation Models [Paper]
    Christian Schlarmann, Matthias Hein (2023)
  • How Robust is Google's Bard to Adversarial Image Attacks? [Paper]
    Yinpeng Dong, Huanran Chen, Jiawei Chen, Zhengwei Fang, Xiao Yang, Yichi Zhang, Yu Tian, Hang Su, Jun Zhu (2023)
  • Test-Time Backdoor Attacks on Multimodal Large Language Models [Paper]
    Dong Lu, Tianyu Pang, Chao Du, Qian Liu, Xianjun Yang, Min Lin (2024)

Cross Modality Searchers

  • MMA-Diffusion: MultiModal Attack on Diffusion Models [Paper]
    Yijun Yang, Ruiyuan Gao, Xiaosen Wang, Tsung-Yi Ho, Nan Xu, Qiang Xu (2023)
  • SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation [Paper]
    Bangyan He, Xiaojun Jia, Siyuan Liang, Tianrui Lou, Yang Liu, Xiaochun Cao (2023)
  • Improving Adversarial Transferability of Visual-Language Pre-training Models through Collaborative Multimodal Interaction [Paper]
    Jiyuan Fu, Zhaoyu Chen, Kaixun Jiang, Haijing Guo, Jiafeng Wang, Shuyong Gao, Wenqiang Zhang (2024)
  • An Image Is Worth 1000 Lies: Transferability of Adversarial Images across Prompts on Vision-Language Models [Paper]
    Haochen Luo, Jindong Gu, Fengyuan Liu, Philip Torr (2024)

Others

  • Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts [Paper]
    Zhi-Yi Chin, Chieh-Ming Jiang, Ching-Chun Huang, Pin-Yu Chen, Wei-Chen Chiu (2023)
  • SneakyPrompt: Jailbreaking Text-to-image Generative Models [Paper]
    Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, Yinzhi Cao (2023)
  • White-box Multimodal Jailbreaks Against Large Vision-Language Models [Paper]
    Ruofan Wang, Xingjun Ma, Hanxu Zhou, Chuanjun Ji, Guangnan Ye, Yu-Gang Jiang (2024)

Defense

Guardrail Defenses

  • Universal Prompt Optimizer for Safe Text-to-Image Generation [Paper]
    Zongyu Wu, Hongcheng Gao, Yueze Wang, Xiang Zhang, Suhang Wang (2024)
  • UFID: A Unified Framework for Input-level Backdoor Detection on Diffusion Models [Paper]
    Zihan Guan, Mengxuan Hu, Sheng Li, Anil Vullikanti (2024)
  • Eyes Closed,Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation [Paper]
    Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung, James T. Kwok, Yu Zhang (2024)
  • Eyes Closed,Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation [Paper]
    Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung, James T. Kwok, Yu Zhang (2024)
  • A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection [Paper]
    Xiaoyu Zhang, Cen Zhang, Tianlin Li, Yihao Huang, Xiaojun Jia, Ming Hu, Jie Zhang, Yang Liu, Shiqing Ma, Chao Shen (2023)
  • MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance [Paper]
    Renjie Pi, Tianyang Han, Jianshu Zhang, Yueqi Xie, Rui Pan, Qing Lian, Hanze Dong, Jipeng Zhang, Tong Zhang (2024)
  • Added Toxicity Mitigation at Inference Time for Multimodal and Massively Multilingual Translation [Paper]
    Marta R. Costa-jussà, David Dale, Maha Elbayad, Bokai Yu (2023)

Other Defenses

  • SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models [Paper]
    Xinfeng Li, Yuchen Yang, Jiangyi Deng, Chen Yan, Yanjiao Chen, Xiaoyu Ji, Wenyuan Xu (2024)
  • SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models [Paper]
    Xinfeng Li, Yuchen Yang, Jiangyi Deng, Chen Yan, Yanjiao Chen, Xiaoyu Ji, Wenyuan Xu (2024)
  • Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models [Paper]
    Yongshuo Zong, Ondrej Bohdal, Tingyang Yu, Yongxin Yang, Timothy Hospedales (2024)
  • Cross-Modality Safety Alignment [Paper]
    Siyin Wang, Xingsong Ye, Qinyuan Cheng, Junwen Duan, Shimin Li, Jinlan Fu, Xipeng Qiu, Xuanjing Huang (2024)
  • Safety Alignment for Vision Language Models [Paper]
    Zhendong Liu, Yuanbi Nie, Yingshui Tan, Xiangyu Yue, Qiushi Cui, Chongjun Wang, Xiaoyong Zhu, Bo Zheng (2024)
  • AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [Paper]
    Yu Wang, Xiaogeng Liu, Yu Li, Muhao Chen, Chaowei Xiao (2024)
  • Safeguarding Vision-Language Models Against Patched Visual Prompt Injectors [Paper]
    Jiachen Sun, Changsheng Wang, Jiongxiao Wang, Yiwei Zhang, Chaowei Xiao (2024)

Application

Agents

  • MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models [Paper]
    Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, Yu Qiao (2023)
  • How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs [Paper]
    Haoqin Tu, Chenhang Cui, Zijun Wang, Yiyang Zhou, Bingchen Zhao, Junlin Han, Wangchunshu Zhou, Huaxiu Yao, Cihang Xie (2023)
  • Towards Red Teaming in Multimodal and Multilingual Translation [Paper]
    Christophe Ropers, David Dale, Prangthip Hansanti, Gabriel Mejia Gonzalez, Ivan Evtimov, Corinne Wong, Christophe Touret, Kristina Pereyra, Seohyun Sonia Kim, Cristian Canton Ferrer, Pierre Andrews, Marta R. Costa-jussà (2024)
  • Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast [Paper]
    Xiangming Gu, Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Ye Wang, Jing Jiang, Min Lin (2024)
  • JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks [Paper]
    Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, Chaowei Xiao (2024)
  • Adversarial Attacks on Multimodal Agents [Paper]
    Chen Henry Wu, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, Aditi Raghunathan (2024)
  • Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks? [Paper]
    Shuo Chen, Zhen Han, Bailan He, Zifeng Ding, Wenqian Yu, Philip Torr, Volker Tresp, Jindong Gu (2024)

Benchmarks

  • Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation [Paper]
    Jessica Quaye, Alicia Parrish, Oana Inel, Charvi Rastogi, Hannah Rose Kirk, Minsuk Kahng, Erin van Liemt, Max Bartolo, Jess Tsang, Justin White, Nathan Clement, Rafael Mosquera, Juan Ciro, Vijay Janapa Reddi, Lora Aroyo (2024)
  • Red Teaming Visual Language Models [Paper]
    Mukai Li, Lei Li, Yuwei Yin, Masood Ahmed, Zhenguang Liu, Qi Liu (2024)