Deception-RepE

We propose a novel method to study and control deceptive behavior in large language models, focusing on "sandbagging" - deliberate underperformance during evaluation. Using a custom dataset of deceptive and honest scenarios, we finetune a LLAMA 3 8B model and apply Linear Artificial Tomography (LAT) scans to detect deception in internal representations. Our results show that Representation Engineering techniques can effectively identify and mitigate deceptive behaviors, including sandbagging, even in out-of-distribution scenarios. This work contributes to developing more robust safety measures for advanced AI systems, advancing the goal of aligned and interpretable artificial general intelligence.

The dataset for Zero shot is linked here: https://huggingface.co/datasets/Avyay10/DeceptionQAnew

The dataset used for Sandbagging is linked here: https://huggingface.co/datasets/Avyay10/DeceptionLabelsFinal

The finetuned model is linked here: https://huggingface.co/Avyay10/llama-3-finetuned-final

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Deception-RepE

Files

README.md

Latest commit

History

README.md

File metadata and controls

Deception-RepE