ML-based Prompt Injection Detector

Nota: O projeto não está completo devido a falta de tempo (trabalhadora-estudante) e vou melhorar na época de Recurso. O relatório também ainda se encontrava incompleto por isso decidi não o colocar neste repositório por enquanto.

A machine learning system for detecting and preventing prompt injection attacks against AI language models using transformer-based architecture.

Overview and Idea for project

This project focuses on detecting prompt injection attacks in large language models (LLMs) using machine learning (ML). With the widespread adoption of LLMs, many systems lack robust input validation, making them vulnerable to attacks that could compromise the Confidentiality, Integrity, and Availability (CIA) of sensitive data. ML is effective for this task as it can identify subtle manipulation attempts by learning from data patterns, rather than relying on predefined rules. Prompt injection, or prompt hacking, manipulates the model's behavior by bypassing safety constraints. By applying ML, my goal is to build a system capable of detecting these complex attack patterns and improving the security of AI-driven systems.

Key Components

The idea of the system presented here is designed to detect prompt injection attacks by analyzing input prompts using a combination of feature extraction, classification, and detailed analysis. The main components of this system will be:

Feature Extraction: A FeatureExtractor class extracts relevant features from the input prompt1
Classification: Two classifier options will in principal be provided - RandomForestClassifier and DistilBERTClassifier. These classifiers predict a risk score and classification (malicious or benign) based on the extracted features
Prompt Analysis: A PromptAnalyzer class performs detailed analysis of prompt characteristics, including length, special character ratio, keyword density, repetition score, structure complexity, risk factors, and pattern matches1

The system aims to identify potential injection attacks by examining various aspects of the prompt, such as command patterns, manipulation attempts, and obfuscation techniques, providing a multi-faceted approach to prompt security.

Dataset Generation (EnhancedDatasetGenerator)

A synthetic training dataset will be generated with both benign and malicious prompts:

Benign Prompts

Uses templates for common questions
Incorporates various topics (technical, general, business)
Adds natural context and complexity

Malicious Prompts

Implements various injection patterns:
- Role manipulation
- System command injection
- Constraint bypass attempts
Adds realistic context
Includes obfuscation techniques

Performance Metrics

Common performance metrics will be used to evaluate the system:

Binary classification metrics (precision, recall, F1)
ROC-AUC score
Confusion matrix
False positive/negative rates

Future Improvements

Model Enhancements
- Optimize architecture and explore other ML method alternatives
Dataset Improvements
- Generate dataset
- Merge the generated dataset with existing datasets from sources like HuggingFace, Kaggle, etc.
Interface for testing if a prompt is benign or malicious

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.devcontainer		.devcontainer
datasets		datasets
frontend		frontend
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
main.ipynb		main.ipynb
main.py		main.py
original_README.md		original_README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML-based Prompt Injection Detector

Overview and Idea for project

Key Components

Dataset Generation (EnhancedDatasetGenerator)

Benign Prompts

Malicious Prompts

Performance Metrics

Future Improvements

About

Releases

Packages

Languages

License

detiuaveiro/aas-malware-icbaptista

Folders and files

Latest commit

History

Repository files navigation

ML-based Prompt Injection Detector

Overview and Idea for project

Key Components

Dataset Generation (EnhancedDatasetGenerator)

Benign Prompts

Malicious Prompts

Performance Metrics

Future Improvements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages