Skip to content

Latest commit

 

History

History
141 lines (95 loc) · 3.46 KB

README.Appendix.md

File metadata and controls

141 lines (95 loc) · 3.46 KB

Appendix

Code

Fig8 and Table2 for PCA-trnasformed embeddings in Fig.1

Fig. 1 generation also produces Fig. 8 and Table 2.

python src/Fig1_make_pca_scatterplot.py

🚨 There was a typo in the definition of $N_r$ in Appendix A:

Correct: $N_r := 2 + \left\lfloor 4\sqrt{\frac{|T_r|}{\max_r{|T_r|}}} \right\rfloor$

Typo (in the paper): $N_r := 2 + \left\lfloor \sqrt{\frac{4|T_r|}{\max_r{|T_r|}}} \right\rfloor$

Sentence length histogram in Fig.9

python src/Appendix_Fig9_make_sentences_histogram.py
fig9.png

Histogram of $\textrm{log}_{10} n_t$ in Fig.10

python src/Appendix_Fig10_make_tokens_histogram.py 
fig10.png

Scatter Plots in Figs. 11, 12, and 13

python src/Appendix_Fig11to13_make_QXt_MXt_VXt_VXtonMXt_scatterplot.py
fig11a.png

Histograms in Figs. 14, 15, and 16

python src/Appendix_Fig14to16_make_QXt_MXt_VXt_histogram.py
fig14a.png

Slope plots in Fig. 17

python src/Appendix_Fig17_make_Slope_of_QXt_MXt_VXt_plot.py
fig17.png

Plots of $M(X)$, $V(X)$, $V_W(X)$, and $V_B(X)$ normalized by Q(X) in Fig. 18

python src/Appendix_Fig18_make_MX_VX_VwX_VbX_per_QX_plot.py 
fig18.png

Plots of $Q(X)$, $M(X)$, and $V(X)$ in Fig. 19

python src/Appendix_Fig19_make_QX_MX_VX_plot.py
fig19.png

Plots of $V_W(X)$, $V_B(X)$, and $V_B(X)/V(X)$ in Fig. 20

python src/Appendix_Fig20_make_VwX_VbX_VbXperVX_plot.py
fig20.png

Appendix J

Setup (For Reproducibility)

Calculate statistical measures of word embeddings

python src/Appendix_J_save_word_stats.py

Generate the word-to-token count dictionary:

python src/Appendix_J_save_word2token_count.py

Setup (Download Experimental Results)

Place the downloaded data in the following structure:

output/
└── word_stats
    └── bookcorpus_train_lt64_pct001_seed0
        ├── bert-base-uncased.pkl
        └── bookcorpus_bert-base-uncased_word2token_count.pkl

Plot Fig. 22:

python src/Appendix_Fig22_make_QXw_MXw_VXw_VXwonMXw_scatterplot.py 
fig22.png

🚨 A bug was fixed, and the dots are now plotted in the order of 1, 2, 3, and 4+. As a result, the figure differs slightly from the one in the paper.