Skip to content

kaushik-3009/Deception-RepE

Repository files navigation

Deception-RepE

We propose a novel method to study and control deceptive behavior in large language models, focusing on "sandbagging" - deliberate underperformance during evaluation. Using a custom dataset of deceptive and honest scenarios, we finetune a LLAMA 3 8B model and apply Linear Artificial Tomography (LAT) scans to detect deception in internal representations. Our results show that Representation Engineering techniques can effectively identify and mitigate deceptive behaviors, including sandbagging, even in out-of-distribution scenarios. This work contributes to developing more robust safety measures for advanced AI systems, advancing the goal of aligned and interpretable artificial general intelligence.

The dataset for Zero shot is linked here: https://huggingface.co/datasets/Avyay10/DeceptionQAnew

The dataset used for Sandbagging is linked here: https://huggingface.co/datasets/Avyay10/DeceptionLabelsFinal

The finetuned model is linked here: https://huggingface.co/Avyay10/llama-3-finetuned-final

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •