EMGSD Hermes

This project explores bias mitigation in GPT2-EMGSD, leveraging correlation analysis for stereotype deduction and activation manipulation, highlighting the potential of an alternative to traditional fine-tuning. Additionally, it demonstrates the feasibility of inducing bias in vanilla GPT2 through activation engineering.

Fast Demo

# Install python 3.10 which is required by SAE-Lens
⁠⁠⁠git clone ⁠ https://github.com/seonglae/emgsd-hermes && cd emgsd-hermes
p⁠ip install torch colorama sae-lens transformers
python compare.py

Main Pipeline

TBA

1. Fine-tuning SAE with EMGSD dataset

python empsd.py

2. Extract features using correlation

python search_category.py
python search_stereo.py
# replace emgsd/*.json files
python draw_corr.py

or if you want to calculate mutual information

python mi_stereo.py

3. Compute ratio of stereotyped text in generation

python compare_all.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EMGSD Hermes

Fast Demo

Main Pipeline

1. Fine-tuning SAE with EMGSD dataset

2. Extract features using correlation

3. Compute ratio of stereotyped text in generation

Loss Graph of fine-tuning SAE

About

Sponsor this project

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
emgsd		emgsd
image		image
.gitignore		.gitignore
Model performance tests.ipynb		Model performance tests.ipynb
README.md		README.md
compare.py		compare.py
compare_all.py		compare_all.py
draw_corr.py		draw_corr.py
empsd.py		empsd.py
mi_stereo.py		mi_stereo.py
pyproject.toml		pyproject.toml
search_category.py		search_category.py
search_stereo.py		search_stereo.py

seonglae/emgsd-hermes

Folders and files

Latest commit

History

Repository files navigation

EMGSD Hermes

Fast Demo

Main Pipeline

1. Fine-tuning SAE with EMGSD dataset

2. Extract features using correlation

3. Compute ratio of stereotyped text in generation

Loss Graph of fine-tuning SAE

About

Topics

Resources

Stars

Watchers

Forks

Sponsor this project

Contributors 3

Languages