Code for the RELATIONS 2019 Workshop paper
Semantic Matching of Documents from Heterogeneous Collections: A Simple and Transparent Method for Practical Applications arXiv
Create a new virtual environment with Python 3.6 first:
$ conda create --name topn python=3.6
$ conda activate topn
Clone this repository:
$ (topn) git clone https://github.com/nlpAThits/TopNCosSimAvg.git
Put the required files into the folders data
, wombat-data
, concept-project-mapping-dataset
, and fastText
(see the respective README.md files in these folders).
The code in this repository uses the WOMBAT-API, which can be installed as follows:
$ (topn) git clone https://github.com/nlpAThits/WOMBAT.git
$ (topn) cd WOMBAT
$ (topn) pip install .
Finally, install the following libraries:
$ (topn) conda install scipy scikit-learn gensim matplotlib colorama tqdm nltk==3.2.5
For the AVG_COS_SIM measure, tuning comprises a brute-force search for the optimal value for the sim_ts parameter (the minimum cosine similarity).
The start, end, and step values for sim_ts can be supplied like this:
--sim_ts start:end:step
,
where start, end, and step must be floats.
The following call will search the whole range for 'label' for all four unit types, where
types = -tf -idf
tokens = +tf -idf
idf_types = -tf +idf
idf_tokens = +tf +idf
--plot_curves yes
causes results plots to be written to the ./plots/
folder.
$ (topn) python perform-c-p-matching.py --input label --embeddings google --measures avg_cos_sim
--sim_ts 0.3:1.0:0.005 --units types,tokens,idf_types,idf_tokens --mode dev --plot_curves yes
For the TOP_N_COS_SIM_AVG measure, tuning comprises a brute-force search for the optimal value for the sim_ts parameter (the minimum cosine similarity, cf. above), plus a search over the optimal value for the top_n parameter.
The range of values for top_n to test can be supplied like this:
--top_n start:end:step
,
where start, end, and step must be integers.
One row in the plot will be created for every value of top_n.
$ (topn) python perform-c-p-matching.py --input label --embeddings google
--measures top_n_cos_sim_avg --top_n 2:30:2
--sim_ts 0.3:1.0:0.005 --units types,tokens,idf_types,idf_tokens --mode dev --plot_curves yes
The following call will reproduce the top avg_cos_sim result reached when only label information is used.
$ (topn) python perform-c-p-matching.py --mode dev --input label --sim_ts .430 --units idf_tokens
--embeddings google --measures avg_cos_sim --print_classifications yes
Since the top results for avg_cos_sim are all yielded with basically the same setting, just change the value for --input and --sim_ts to reproduce the other top baseline results.
$ (topn) python perform-c-p-matching.py --mode dev --input description --sim_ts .530 --units idf_tokens
--embeddings google --measures avg_cos_sim --print_classifications yes
$ (topn) python perform-c-p-matching.py --mode dev --input both --sim_ts .545 --units idf_tokens
--embeddings google --measures avg_cos_sim --print_classifications yes
Likewise, the following calls will reproduce the three top top_n_cos_sim_avg results:
$ (topn) python perform-c-p-matching.py --mode dev --input label --sim_ts .345 --units tokens
--embeddings google --measures top_n_cos_sim_avg --top_n 22 --print_classifications yes
$ (topn) python perform-c-p-matching.py --mode dev --input description --sim_ts .345 --units idf_tokens
--embeddings glove --measures top_n_cos_sim_avg --top_n 6 --print_classifications yes
$ (topn) python perform-c-p-matching.py --mode dev --input both --sim_ts .310 --units idf_tokens
--embeddings fasttext --measures top_n_cos_sim_avg --top_n 14 --print_classifications yes