Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perplexity.py & perplexity_mlm.py #23

Open
fengyx8 opened this issue Dec 15, 2024 · 5 comments
Open

perplexity.py & perplexity_mlm.py #23

fengyx8 opened this issue Dec 15, 2024 · 5 comments

Comments

@fengyx8
Copy link

fengyx8 commented Dec 15, 2024

Hi, authors!

I tried to reproduce the evaluation results of perplexity on WikiText-2 over the baseline models such as the bert-base-uncased, INLP, and Sentence-Debias, but I failed.

For the bert-base-uncased, I got the final perplexity: 2059538.375, which is significantly different from the reported result 4.469 in your paper. My script is:

python experiments/perplexity.py --model "BertForMaskedLM" --model_name_or_path "bert-base-uncased" --output_dir "bias-bench/results/perplexity/bert-base-uncased" --bias_type "gender"  --max_length 512

For the INLP, and Sentence-Debias, I failed to get any outputs with the script:

python experiments/perplexity.py --model "SentenceDebiasBertForMaskedLM" --model_name_or_path "bert-base-uncased" --bias_direction "bias-bench/results/subspace/subspace_m-BertModel_c-bert-base-uncased_t-gender.pt" --bias_type gender" --max_length 512 --output_dir "bias-bench/results/perplexity/SentenceDebias"

By the way, I noticed that in the batch_jobs/perplexity.sh, it actually exists two distinct files for evaluating perplextiy: experiments/perplexity.py and experiments/perplexity_mlm.py . Maybe the former file is used for evaluating the gpt based models, while the latter one is for the bert based models.

There's no experiments/perplexity_mlm.py in this repository, and may you share it with us?

Thank you for your open source code!

@ncmeade
Copy link
Collaborator

ncmeade commented Dec 16, 2024

Hi @fengyx8,

Thanks for raising the issue!

You're right--the experiments/perplexity.py is for causal LMs. This is why you obtain the odd perplexities using that script to evaluate BERT.

For MLMs, we computed pseudo-perplexities. I believe I intended to clean up the pseudo-perplexity evaluation script before adding it to the repository (but had forgotten)!

I've just added the rough experiment script we used (250539d). The code is brittle in some places, but it should give you an idea of the logic to evaluate the perplexities for MLMs (see lines 668 -- 708). Note, I didn't implement batched evaluation, so the code should throw an error if you try to run it with a batch size greater than one.

Hope this helps! Happy to answer any more questions.

@fengyx8
Copy link
Author

fengyx8 commented Dec 19, 2024

Thank you for your prompt reply and for updating the code!
After computing the pseudo perplexity on bert-base-uncased, INLP, and Sentence-Debias, I observed that my results (8.39, 12.43, and 8.42 respectively) differ from those reported in the paper (4.469, 6.152, and 4.483 as listed in Table 5).
My scripts are:

# bert-base-uncased
python experiments/perplexity_mlm.py --model "BertForMaskedLM" --model_name_or_path "bert-base-uncased" --output_dir "results/perplexity/bert-base-uncased" --bias_type "gender" --dataset_name "wikitext" --dataset_config_name "wikitext-2-raw-v1" --per_device_train_batch_size 1 --per_device_eval_batch_size 1

# INLP
python experiments/inlp_projection_matrix.py --model "BertModel" --model_name_or_path "bert-base-uncased" --bias_type "gender"
python experiments/perplexity_mlm.py --model "INLPBertForMaskedLM" --model_name_or_path "bert-base-uncased" --projection_matrix "results/projection_matrix/projection_m-BertModel_c-bert-base-uncased_t-gender_s-0.pt" --output_dir "results/perplexity/INLP" --bias_type "gender" --dataset_name "wikitext" --dataset_config_name "wikitext-2-raw-v1" --per_device_train_batch_size 1 --per_device_eval_batch_size 1

# Sentence-Debias
python experiments/sentence_debias_subspace.py --model "BertModel" --model_name_or_path "bert-base-uncased" --bias_type "gender"
python experiments/perplexity_mlm.py --model "SentenceDebiasBertForMaskedLM" --model_name_or_path "bert-base-uncased" --bias_direction "results/subspace/subspace_m-BertModel_c-bert-base-uncased_t-gender.pt" --output_dir "results/perplexity/SentenceDebiasBert" --bias_type "gender" --dataset_name "wikitext" --dataset_config_name "wikitext-2-raw-v1" --per_device_train_batch_size 1 --per_device_eval_batch_size 1

Could you advise on how I might accurately reproduce the findings presented in the publication?

Thanks again! :)

@ncmeade
Copy link
Collaborator

ncmeade commented Dec 20, 2024

Hi @fengyx8 ,

Just to first confirm, are you using the package versions provided in setup.py? Also, are you using the WikiText provided in the README (or from Hugging Face datasets)?

@fengyx8
Copy link
Author

fengyx8 commented Dec 20, 2024

Hi @ncmeade!

The package versions I am using are consistent with those specified in setup.py:

        "torch==1.10.2",
        "transformers==4.16.2",
        "scipy==1.7.3",
        "scikit-learn==1.0.2",
        "nltk==3.7.0",
        "datasets==1.18.3",
        "accelerate==0.5.1",

I used the Wikipedia-2.5 dataset, downloaded from the provided URL, to train both the INLP and SentenceDebias models according to the instructions in README.md.

For computing the pseudo perplexity, I loaded the Wikitext-2 dataset directly from Hugging Face.

@ncmeade
Copy link
Collaborator

ncmeade commented Dec 29, 2024

Hi @fengyx8, sorry for the delayed response!

Could you try with the Wikipedia-10 dataset included in the README? This should be the 10% of Wikipedia we used in the paper.

The data used for computing the perplexities should have a substantial impact. However, I see with your results you still obtain the same relative ordering of the BERT, INLP, and SentenceDebias.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants