This project explores bias mitigation in GPT2-EMGSD, leveraging correlation analysis for stereotype deduction and activation manipulation, highlighting the potential of an alternative to traditional fine-tuning. Additionally, it demonstrates the feasibility of inducing bias in vanilla GPT2 through activation engineering.
# Install python 3.10 which is required by SAE-Lens
git clone https://github.com/seonglae/emgsd-hermes && cd emgsd-hermes
pip install torch colorama sae-lens transformers
python compare.py
TBA
python empsd.py
python search_category.py
python search_stereo.py
# replace emgsd/*.json files
python draw_corr.py
or if you want to calculate mutual information
python mi_stereo.py
python compare_all.py
![image](https://private-user-images.githubusercontent.com/27716524/389290025-20ba51ae-7f58-4f11-af5c-5a9eaa2cd0da.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkwMTQwMjYsIm5iZiI6MTczOTAxMzcyNiwicGF0aCI6Ii8yNzcxNjUyNC8zODkyOTAwMjUtMjBiYTUxYWUtN2Y1OC00ZjExLWFmNWMtNWE5ZWFhMmNkMGRhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA4VDExMjIwNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTM1ZDE0MzliNTkwMDhmOGQxMTg5NDc0M2YwNjk3MGE2OWFjMjgxMDg2MWNiMjkxODcwYTI4MWE3MGYwZDUyOTkmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.AJSztKBiFnU3OeJaOvlyD_m5HcjEcRL8ZFaenTFMIj0)