Nota: O projeto não está completo devido a falta de tempo (trabalhadora-estudante) e vou melhorar na época de Recurso. O relatório também ainda se encontrava incompleto por isso decidi não o colocar neste repositório por enquanto.
A machine learning system for detecting and preventing prompt injection attacks against AI language models using transformer-based architecture.
This project focuses on detecting prompt injection attacks in large language models (LLMs) using machine learning (ML). With the widespread adoption of LLMs, many systems lack robust input validation, making them vulnerable to attacks that could compromise the Confidentiality, Integrity, and Availability (CIA) of sensitive data. ML is effective for this task as it can identify subtle manipulation attempts by learning from data patterns, rather than relying on predefined rules. Prompt injection, or prompt hacking, manipulates the model's behavior by bypassing safety constraints. By applying ML, my goal is to build a system capable of detecting these complex attack patterns and improving the security of AI-driven systems.
The idea of the system presented here is designed to detect prompt injection attacks by analyzing input prompts using a combination of feature extraction, classification, and detailed analysis. The main components of this system will be:
- Feature Extraction: A FeatureExtractor class extracts relevant features from the input prompt1
- Classification: Two classifier options will in principal be provided - RandomForestClassifier and DistilBERTClassifier. These classifiers predict a risk score and classification (malicious or benign) based on the extracted features
- Prompt Analysis: A PromptAnalyzer class performs detailed analysis of prompt characteristics, including length, special character ratio, keyword density, repetition score, structure complexity, risk factors, and pattern matches1
The system aims to identify potential injection attacks by examining various aspects of the prompt, such as command patterns, manipulation attempts, and obfuscation techniques, providing a multi-faceted approach to prompt security.
A synthetic training dataset will be generated with both benign and malicious prompts:
- Uses templates for common questions
- Incorporates various topics (technical, general, business)
- Adds natural context and complexity
- Implements various injection patterns:
- Role manipulation
- System command injection
- Constraint bypass attempts
- Adds realistic context
- Includes obfuscation techniques
Common performance metrics will be used to evaluate the system:
- Binary classification metrics (precision, recall, F1)
- ROC-AUC score
- Confusion matrix
- False positive/negative rates
-
Model Enhancements
- Optimize architecture and explore other ML method alternatives
-
Dataset Improvements
- Generate dataset
- Merge the generated dataset with existing datasets from sources like HuggingFace, Kaggle, etc.
-
Interface for testing if a prompt is benign or malicious