`perplexity.py` & `perplexity_mlm.py` #23

fengyx8 · 2024-12-15T08:35:06Z

Hi, authors!

I tried to reproduce the evaluation results of perplexity on WikiText-2 over the baseline models such as the bert-base-uncased, INLP, and Sentence-Debias, but I failed.

For the bert-base-uncased, I got the final perplexity: 2059538.375, which is significantly different from the reported result 4.469 in your paper. My script is:

python experiments/perplexity.py --model "BertForMaskedLM" --model_name_or_path "bert-base-uncased" --output_dir "bias-bench/results/perplexity/bert-base-uncased" --bias_type "gender"  --max_length 512

For the INLP, and Sentence-Debias, I failed to get any outputs with the script:

python experiments/perplexity.py --model "SentenceDebiasBertForMaskedLM" --model_name_or_path "bert-base-uncased" --bias_direction "bias-bench/results/subspace/subspace_m-BertModel_c-bert-base-uncased_t-gender.pt" --bias_type gender" --max_length 512 --output_dir "bias-bench/results/perplexity/SentenceDebias"

By the way, I noticed that in the batch_jobs/perplexity.sh, it actually exists two distinct files for evaluating perplextiy: experiments/perplexity.py and experiments/perplexity_mlm.py . Maybe the former file is used for evaluating the gpt based models, while the latter one is for the bert based models.

There's no experiments/perplexity_mlm.py in this repository, and may you share it with us?

Thank you for your open source code!

The text was updated successfully, but these errors were encountered:

ncmeade · 2024-12-16T22:39:16Z

Hi @fengyx8,

Thanks for raising the issue!

You're right--the experiments/perplexity.py is for causal LMs. This is why you obtain the odd perplexities using that script to evaluate BERT.

For MLMs, we computed pseudo-perplexities. I believe I intended to clean up the pseudo-perplexity evaluation script before adding it to the repository (but had forgotten)!

I've just added the rough experiment script we used (250539d). The code is brittle in some places, but it should give you an idea of the logic to evaluate the perplexities for MLMs (see lines 668 -- 708). Note, I didn't implement batched evaluation, so the code should throw an error if you try to run it with a batch size greater than one.

Hope this helps! Happy to answer any more questions.

fengyx8 · 2024-12-19T09:06:54Z

Thank you for your prompt reply and for updating the code!
After computing the pseudo perplexity on bert-base-uncased, INLP, and Sentence-Debias, I observed that my results (8.39, 12.43, and 8.42 respectively) differ from those reported in the paper (4.469, 6.152, and 4.483 as listed in Table 5).
My scripts are:

# bert-base-uncased
python experiments/perplexity_mlm.py --model "BertForMaskedLM" --model_name_or_path "bert-base-uncased" --output_dir "results/perplexity/bert-base-uncased" --bias_type "gender" --dataset_name "wikitext" --dataset_config_name "wikitext-2-raw-v1" --per_device_train_batch_size 1 --per_device_eval_batch_size 1

# INLP
python experiments/inlp_projection_matrix.py --model "BertModel" --model_name_or_path "bert-base-uncased" --bias_type "gender"
python experiments/perplexity_mlm.py --model "INLPBertForMaskedLM" --model_name_or_path "bert-base-uncased" --projection_matrix "results/projection_matrix/projection_m-BertModel_c-bert-base-uncased_t-gender_s-0.pt" --output_dir "results/perplexity/INLP" --bias_type "gender" --dataset_name "wikitext" --dataset_config_name "wikitext-2-raw-v1" --per_device_train_batch_size 1 --per_device_eval_batch_size 1

# Sentence-Debias
python experiments/sentence_debias_subspace.py --model "BertModel" --model_name_or_path "bert-base-uncased" --bias_type "gender"
python experiments/perplexity_mlm.py --model "SentenceDebiasBertForMaskedLM" --model_name_or_path "bert-base-uncased" --bias_direction "results/subspace/subspace_m-BertModel_c-bert-base-uncased_t-gender.pt" --output_dir "results/perplexity/SentenceDebiasBert" --bias_type "gender" --dataset_name "wikitext" --dataset_config_name "wikitext-2-raw-v1" --per_device_train_batch_size 1 --per_device_eval_batch_size 1

Could you advise on how I might accurately reproduce the findings presented in the publication?

Thanks again! :)

ncmeade · 2024-12-20T03:35:30Z

Hi @fengyx8 ,

Just to first confirm, are you using the package versions provided in setup.py? Also, are you using the WikiText provided in the README (or from Hugging Face datasets)?

fengyx8 · 2024-12-20T05:33:01Z

Hi @ncmeade!

The package versions I am using are consistent with those specified in setup.py:

        "torch==1.10.2",
        "transformers==4.16.2",
        "scipy==1.7.3",
        "scikit-learn==1.0.2",
        "nltk==3.7.0",
        "datasets==1.18.3",
        "accelerate==0.5.1",

I used the Wikipedia-2.5 dataset, downloaded from the provided URL, to train both the INLP and SentenceDebias models according to the instructions in README.md.

For computing the pseudo perplexity, I loaded the Wikitext-2 dataset directly from Hugging Face.

ncmeade · 2024-12-29T16:17:28Z

Hi @fengyx8, sorry for the delayed response!

Could you try with the Wikipedia-10 dataset included in the README? This should be the 10% of Wikipedia we used in the paper.

The data used for computing the perplexities should have a substantial impact. However, I see with your results you still obtain the same relative ordering of the BERT, INLP, and SentenceDebias.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`perplexity.py` & `perplexity_mlm.py` #23

`perplexity.py` & `perplexity_mlm.py` #23

fengyx8 commented Dec 15, 2024

ncmeade commented Dec 16, 2024

fengyx8 commented Dec 19, 2024

ncmeade commented Dec 20, 2024

fengyx8 commented Dec 20, 2024

ncmeade commented Dec 29, 2024

perplexity.py & perplexity_mlm.py #23

perplexity.py & perplexity_mlm.py #23

Comments

fengyx8 commented Dec 15, 2024

ncmeade commented Dec 16, 2024

fengyx8 commented Dec 19, 2024

ncmeade commented Dec 20, 2024

fengyx8 commented Dec 20, 2024

ncmeade commented Dec 29, 2024

`perplexity.py` & `perplexity_mlm.py` #23

`perplexity.py` & `perplexity_mlm.py` #23