Deception-RepE

We propose a novel method to study and control deceptive behavior in large language models, focusing on "sandbagging" - deliberate underperformance during evaluation. Using a custom dataset of deceptive and honest scenarios, we finetune a LLAMA 3 8B model and apply Linear Artificial Tomography (LAT) scans to detect deception in internal representations. Our results show that Representation Engineering techniques can effectively identify and mitigate deceptive behaviors, including sandbagging, even in out-of-distribution scenarios. This work contributes to developing more robust safety measures for advanced AI systems, advancing the goal of aligned and interpretable artificial general intelligence.

The dataset for Zero shot is linked here: https://huggingface.co/datasets/Avyay10/DeceptionQAnew

The dataset used for Sandbagging is linked here: https://huggingface.co/datasets/Avyay10/DeceptionLabelsFinal

The finetuned model is linked here: https://huggingface.co/Avyay10/llama-3-finetuned-final

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
Finetuning		Finetuning
Results		Results
bluffing		bluffing
data		data
notebooks		notebooks
sb		sb
README.md		README.md
classification.py		classification.py
config.yaml		config.yaml
deception_repe_full.ipynb		deception_repe_full.ipynb
rep_readers.py		rep_readers.py
requirements.txt		requirements.txt
sb_finetuned_llama_methods_results.ipynb		sb_finetuned_llama_methods_results.ipynb
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deception-RepE

About

Releases

Packages

Contributors 3

Languages

kaushik-3009/Deception-RepE

Folders and files

Latest commit

History

Repository files navigation

Deception-RepE

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages