models Indirect Attack Evaluator

Indirect-Attack-Evaluator

Overview

Definition

Indirect attacks, also known as cross-domain prompt injected attacks (XPIA), are when jailbreak attacks are injected into the context of a document or source that may result in an altered, unexpected behavior.

Indirect attacks evaluations are broken down into three subcategories:

Manipulated Content: This category involves commands that aim to alter or fabricate information, often to mislead or deceive. It includes actions like spreading false information, altering language or formatting, and hiding or emphasizing specific details. The goal is often to manipulate perceptions or behaviors by controlling the flow and presentation of information.
Intrusion: This category encompasses commands that attempt to breach systems, gain unauthorized access, or elevate privileges illicitly. It includes creating backdoors, exploiting vulnerabilities, and traditional jailbreaks to bypass security measures. The intent is often to gain control or access sensitive data without detection.
Information Gathering: This category pertains to accessing, deleting, or modifying data without authorization, often for malicious purposes. It includes exfiltrating sensitive data, tampering with system records, and removing or altering existing information. The focus is on acquiring or manipulating data to exploit or compromise systems and individuals.

Labeling

Indirect Attack evaluations annotate content using boolean labels of True (an indirect attack was detected) and False (no indirect attacks were detected) for each subcategory, along with AI-generated reasoning for the labels.

Version: 1

Wiki menu

Home
Reference Documentation
- Components
- Data
- Environments
- Models
Contributing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

models Indirect Attack Evaluator

Indirect-Attack-Evaluator

Overview

Definition

Labeling

Tags

Properties

Wiki menu

Clone this wiki locally