Norm of Mean Contextualized Embeddings Determines their Variance
Hiroaki Yamagiwa, Hidetoshi Shimodaira
COLING 2025
This repository is intended to be run in a Docker environment. If you are not familiar with Docker, please install the packages listed in requirements.txt.
Create a Docker image as follows:
$ bash script/docker/build.sh
Set the DOCKER_HOME
environment variable to specify the path of the directory to be mounted as the home directory inside the Docker container.
export DOCKER_HOME="path/to/your/docker_home"
Run the Docker container by passing the GPU ID as an argument:
$ bash script/docker/run.sh 0
- Sentences extracted from bookcorpus (Google Drive)
- Statistical measures for each model (Google Drive)
Place the downloaded data in the following structure:
output/
├── datasets
│ └── bookcorpus_train_lt64_pct001_seed0.pkl
└── token_stats
└── bookcorpus_train_lt64_pct001_seed0
├── bert-base-uncased.pkl
├── bert-large-uncased.pkl
├── gpt2-medium.pkl
├── gpt2.pkl
├── roberta-base.pkl
└── roberta-large.pkl
To regenerate statistical measures:
python src/save_token_stats.py --model_name model_name
The model_name
values supported are bert-base-uncased
, bert-large-uncased
, roberta-base
, roberta-large
, gpt2
, gpt2-medium
.
python src/Fig1_make_pca_scatterplot.py
This script also generates Fig. 8. and Table 2. See README.Appendix.md for more details.
python src/Fig2_make_VXt_on_MXt_scatterplot.py
🚨 Note: The color bar range in the published figure was incorrect. While the color bar for BERT was shown, the ranges were not unified across models. This issue has been fixed, and its impact is minimal.
python src/Fig4_make_QXtCV_MXtVXtSlope_MXtVXtR2_plot.py
python src/Fig5_make_MXVwXVbX_per_QX_bargraph.py
python src/Fig6_make_VwX_per_VX_plot.py
python src/Fig7_make_BERTbase_QXt_MXt_VXt_scatterplot.py
The code for generating embeddings was inspired by:
Wannasuphoprasit et al. Solving Cosine Similarity Underestimation between High Frequency Words by $\ell_2$ Norm Discounting. ACL 2023 Findings.
We sincerely thank the authors for sharing their LivNLP/cosine-discounting codebase.
See README.Appendix.md for the experiments in the Appendix.
- Since the URLs of published datasets may change, please refer to the GitHub repository URL instead of the direct URL when referencing in papers, etc.
- This directory was created by Hiroaki Yamagiwa.