Appendix

Code

Fig8 and Table2 for PCA-trnasformed embeddings in Fig.1

Fig. 1 generation also produces Fig. 8 and Table 2.

python src/Fig1_make_pca_scatterplot.py

🚨 There was a typo in the definition of $N_r$ in Appendix A:

Correct: $N_r := 2 + \left\lfloor 4\sqrt{\frac{|T_r|}{\max_r{|T_r|}}} \right\rfloor$

Typo (in the paper): $N_r := 2 + \left\lfloor \sqrt{\frac{4|T_r|}{\max_r{|T_r|}}} \right\rfloor$

Sentence length histogram in Fig.9

python src/Appendix_Fig9_make_sentences_histogram.py

Histogram of $\textrm{log}_{10} n_t$ in Fig.10

python src/Appendix_Fig10_make_tokens_histogram.py

Scatter Plots in Figs. 11, 12, and 13

python src/Appendix_Fig11to13_make_QXt_MXt_VXt_VXtonMXt_scatterplot.py

Histograms in Figs. 14, 15, and 16

python src/Appendix_Fig14to16_make_QXt_MXt_VXt_histogram.py

Slope plots in Fig. 17

python src/Appendix_Fig17_make_Slope_of_QXt_MXt_VXt_plot.py

Plots of $M(X)$, $V(X)$, $V_W(X)$, and $V_B(X)$ normalized by Q(X) in Fig. 18

python src/Appendix_Fig18_make_MX_VX_VwX_VbX_per_QX_plot.py

Plots of $Q(X)$, $M(X)$, and $V(X)$ in Fig. 19

python src/Appendix_Fig19_make_QX_MX_VX_plot.py

Plots of $V_W(X)$, $V_B(X)$, and $V_B(X)/V(X)$ in Fig. 20

python src/Appendix_Fig20_make_VwX_VbX_VbXperVX_plot.py

Appendix J

Setup (For Reproducibility)

Calculate statistical measures of word embeddings

python src/Appendix_J_save_word_stats.py

Generate the word-to-token count dictionary:

python src/Appendix_J_save_word2token_count.py

Setup (Download Experimental Results)

bert-base-uncased statistics (Google Drive)
word-to-token count dictionary (Google Drive)

Place the downloaded data in the following structure:

output/
└── word_stats
    └── bookcorpus_train_lt64_pct001_seed0
        ├── bert-base-uncased.pkl
        └── bookcorpus_bert-base-uncased_word2token_count.pkl

Plot Fig. 22:

python src/Appendix_Fig22_make_QXw_MXw_VXw_VXwonMXw_scatterplot.py

🚨 A bug was fixed, and the dots are now plotted in the order of 1, 2, 3, and 4+. As a result, the figure differs slightly from the one in the paper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.Appendix.md

README.Appendix.md

Appendix

Code

Fig8 and Table2 for PCA-trnasformed embeddings in Fig.1

Sentence length histogram in Fig.9

Histogram of $\textrm{log}_{10} n_t$ in Fig.10

Scatter Plots in Figs. 11, 12, and 13

Histograms in Figs. 14, 15, and 16

Slope plots in Fig. 17

Plots of $M(X)$, $V(X)$, $V_W(X)$, and $V_B(X)$ normalized by Q(X) in Fig. 18

Plots of $Q(X)$, $M(X)$, and $V(X)$ in Fig. 19

Plots of $V_W(X)$, $V_B(X)$, and $V_B(X)/V(X)$ in Fig. 20

Appendix J

Setup (For Reproducibility)

Setup (Download Experimental Results)

Files

README.Appendix.md

Latest commit

History

README.Appendix.md

File metadata and controls

Appendix

Code

Fig8 and Table2 for PCA-trnasformed embeddings in Fig.1

Sentence length histogram in Fig.9

Histogram of $\textrm{log}_{10} n_t$ in Fig.10

Scatter Plots in Figs. 11, 12, and 13

Histograms in Figs. 14, 15, and 16

Slope plots in Fig. 17

Plots of $M(X)$, $V(X)$, $V_W(X)$, and $V_B(X)$ normalized by Q(X) in Fig. 18

Plots of $Q(X)$, $M(X)$, and $V(X)$ in Fig. 19

Plots of $V_W(X)$, $V_B(X)$, and $V_B(X)/V(X)$ in Fig. 20

Appendix J

Setup (For Reproducibility)

Setup (Download Experimental Results)