Skip to content

detiuaveiro/aas-malware-icbaptista

 
 

Repository files navigation

ML-based Prompt Injection Detector

Nota: O projeto não está completo devido a falta de tempo (trabalhadora-estudante) e vou melhorar na época de Recurso. O relatório também ainda se encontrava incompleto por isso decidi não o colocar neste repositório por enquanto.

A machine learning system for detecting and preventing prompt injection attacks against AI language models using transformer-based architecture.

Overview and Idea for project

This project focuses on detecting prompt injection attacks in large language models (LLMs) using machine learning (ML). With the widespread adoption of LLMs, many systems lack robust input validation, making them vulnerable to attacks that could compromise the Confidentiality, Integrity, and Availability (CIA) of sensitive data. ML is effective for this task as it can identify subtle manipulation attempts by learning from data patterns, rather than relying on predefined rules. Prompt injection, or prompt hacking, manipulates the model's behavior by bypassing safety constraints. By applying ML, my goal is to build a system capable of detecting these complex attack patterns and improving the security of AI-driven systems.

Key Components

The idea of the system presented here is designed to detect prompt injection attacks by analyzing input prompts using a combination of feature extraction, classification, and detailed analysis. The main components of this system will be:

  • Feature Extraction: A FeatureExtractor class extracts relevant features from the input prompt1
  • Classification: Two classifier options will in principal be provided - RandomForestClassifier and DistilBERTClassifier. These classifiers predict a risk score and classification (malicious or benign) based on the extracted features
  • Prompt Analysis: A PromptAnalyzer class performs detailed analysis of prompt characteristics, including length, special character ratio, keyword density, repetition score, structure complexity, risk factors, and pattern matches1

The system aims to identify potential injection attacks by examining various aspects of the prompt, such as command patterns, manipulation attempts, and obfuscation techniques, providing a multi-faceted approach to prompt security.

Dataset Generation (EnhancedDatasetGenerator)

A synthetic training dataset will be generated with both benign and malicious prompts:

Benign Prompts

  • Uses templates for common questions
  • Incorporates various topics (technical, general, business)
  • Adds natural context and complexity

Malicious Prompts

  • Implements various injection patterns:
    • Role manipulation
    • System command injection
    • Constraint bypass attempts
  • Adds realistic context
  • Includes obfuscation techniques

Performance Metrics

Common performance metrics will be used to evaluate the system:

  • Binary classification metrics (precision, recall, F1)
  • ROC-AUC score
  • Confusion matrix
  • False positive/negative rates

Future Improvements

  1. Model Enhancements

    • Optimize architecture and explore other ML method alternatives
  2. Dataset Improvements

    • Generate dataset
    • Merge the generated dataset with existing datasets from sources like HuggingFace, Kaggle, etc.
  3. Interface for testing if a prompt is benign or malicious

About

aas24-aas-malware-aas-malware-pe created by GitHub Classroom

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 45.9%
  • Jupyter Notebook 43.6%
  • TypeScript 9.3%
  • Dockerfile 1.2%