- The original implementation of BaySMM is now moved to legacy
- The multilingual version also supports monolingual training.
-
The paper is available on arXiv
-
See INSTALL.md for requirements and installation instructions.
-
steps/
- Contains recipes for downloading parallel data from OPUS to train MBay models and downstream classifiers -
maby/
- Source code for the model definition, training utils, etc. -
scripts/
- Source code for data preparation and classifiers -
lists/
- Contains list of languages and dataset names used for trianing - these files are need for recipes in the steps/ -
etc/
- json files topic ID to int mapping -
pylibs/
- add to PYTHONPATH -
data_splits/
- Contains file IDs / doc IDs / rowIDs for creating 5 splits for MLDoc and INA -
To create MLDoc5x Use the give file IDs with the code from original MLDoc
-
To create INA5x Dowload the original from IndicNLPsuite and run
src/generate_documents_indic_news_articles.py
-
See
steps/
0 through 6 for downloading data, preparing, extracting bow features, training mBay and cross-lingual classification
bash env.sh
- Download Europarl, MultiUN, NewsCommentary, GlobalVoices, for languages en, de, es, fr, it, ru.
steps/0_download_data_and_extract.sh lists/dataset_7L.list lists/langs_6L.list data/
steps/1_apply_msl_constraint_and_extract.sh lists/dataset_7L.list lists/langs_6L.list data/ 35
steps/2_create_flists_per_language.py data/ lists/dataset_7L.list -lang_list_file lists/langs_6L.list -msl 35 -out_flist_dir flists/eu-multiun-nc-gv/
- Create bag-of-words statistics using data for each language, independently. Here, we are limiting the max vocab size to
10000
, but it can be even higher.
python steps/3_build_vocabulary_and_extract_bow_stats.py flists/eu-multiun-nc-gv/ exp/bow/eu-multiun-nc-gv/ -lang_list_file lists/langs_6L.list -xtr_flist_dir flists/eu-multiun-nc-gv/ -xtr_tag eu-multiun-nc-gv -mv 10000
steps/4_create_train_json.sh lists/dataset_7L.list lists/langs_6L.list exp/bow/eu-multiun-nc-gv/
steps/5_train_model_and_xtr_embeddings.sh exp/bow/eu-multiun-nc-gv/ 6L
@misc{kesiraju2020bayesian,
title={A Bayesian Multilingual Document Model for Zero-shot Topic Identification and Discovery},
author={Santosh Kesiraju and Sangeet Sagar and Ondřej Glembek and Lukáš Burget and Ján Černocký and Suryakanth V Gangashetty},
year={2020},
eprint={2007.01359v3},
archivePrefix={arXiv},
primaryClass={cs.CL}
}