MBay: A Bayesian Multilingual Document Model for Zero-shot Topic Identification and Discovery

The original implementation of BaySMM is now moved to legacy
The multilingual version also supports monolingual training.

About

The paper is available on arXiv
See INSTALL.md for requirements and installation instructions.
steps/ - Contains recipes for downloading parallel data from OPUS to train MBay models and downstream classifiers
maby/ - Source code for the model definition, training utils, etc.
scripts/ - Source code for data preparation and classifiers
lists/ - Contains list of languages and dataset names used for trianing - these files are need for recipes in the steps/
etc/ - json files topic ID to int mapping
pylibs/ - add to PYTHONPATH
data_splits/ - Contains file IDs / doc IDs / rowIDs for creating 5 splits for MLDoc and INA
To create MLDoc5x Use the give file IDs with the code from original MLDoc
To create INA5x Dowload the original from IndicNLPsuite and run src/generate_documents_indic_news_articles.py
See steps/ 0 through 6 for downloading data, preparing, extracting bow features, training mBay and cross-lingual classification

Recipe

bash env.sh

Download Europarl, MultiUN, NewsCommentary, GlobalVoices, for languages en, de, es, fr, it, ru.

steps/0_download_data_and_extract.sh lists/dataset_7L.list lists/langs_6L.list data/

steps/1_apply_msl_constraint_and_extract.sh lists/dataset_7L.list lists/langs_6L.list data/ 35

steps/2_create_flists_per_language.py data/ lists/dataset_7L.list -lang_list_file lists/langs_6L.list -msl 35 -out_flist_dir flists/eu-multiun-nc-gv/

Create bag-of-words statistics using data for each language, independently. Here, we are limiting the max vocab size to 10000, but it can be even higher.

python steps/3_build_vocabulary_and_extract_bow_stats.py flists/eu-multiun-nc-gv/ exp/bow/eu-multiun-nc-gv/ -lang_list_file lists/langs_6L.list -xtr_flist_dir flists/eu-multiun-nc-gv/ -xtr_tag eu-multiun-nc-gv -mv 10000

steps/4_create_train_json.sh lists/dataset_7L.list lists/langs_6L.list exp/bow/eu-multiun-nc-gv/

steps/5_train_model_and_xtr_embeddings.sh exp/bow/eu-multiun-nc-gv/ 6L

Citation

@misc{kesiraju2020bayesian,
      title={A Bayesian Multilingual Document Model for Zero-shot Topic Identification and Discovery},
      author={Santosh Kesiraju and Sangeet Sagar and Ondřej Glembek and Lukáš Burget and Ján Černocký and Suryakanth V Gangashetty},
      year={2020},
      eprint={2007.01359v3},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

MBay: A Bayesian Multilingual Document Model for Zero-shot Topic Identification and Discovery

About

Recipe

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

MBay: A Bayesian Multilingual Document Model for Zero-shot Topic Identification and Discovery

About

Recipe

Citation