Fig. 1 generation also produces Fig. 8 and Table 2.
python src/Fig1_make_pca_scatterplot.py
🚨 There was a typo in the definition of
Correct:
Typo (in the paper):
python src/Appendix_Fig9_make_sentences_histogram.py
python src/Appendix_Fig10_make_tokens_histogram.py
python src/Appendix_Fig11to13_make_QXt_MXt_VXt_VXtonMXt_scatterplot.py
python src/Appendix_Fig14to16_make_QXt_MXt_VXt_histogram.py
python src/Appendix_Fig17_make_Slope_of_QXt_MXt_VXt_plot.py
python src/Appendix_Fig18_make_MX_VX_VwX_VbX_per_QX_plot.py
python src/Appendix_Fig19_make_QX_MX_VX_plot.py
python src/Appendix_Fig20_make_VwX_VbX_VbXperVX_plot.py
Calculate statistical measures of word embeddings
python src/Appendix_J_save_word_stats.py
Generate the word-to-token count dictionary:
python src/Appendix_J_save_word2token_count.py
Place the downloaded data in the following structure:
output/
└── word_stats
└── bookcorpus_train_lt64_pct001_seed0
├── bert-base-uncased.pkl
└── bookcorpus_bert-base-uncased_word2token_count.pkl
Plot Fig. 22:
python src/Appendix_Fig22_make_QXw_MXw_VXw_VXwonMXw_scatterplot.py
🚨 A bug was fixed, and the dots are now plotted in the order of 1, 2, 3, and 4+. As a result, the figure differs slightly from the one in the paper.