Skip to content

Latest commit

 

History

History
144 lines (115 loc) · 6.4 KB

README.md

File metadata and controls

144 lines (115 loc) · 6.4 KB

Unsupervised Cross-lingual Representation Learning at Scale (XLM-RoBERTa)

https://arxiv.org/pdf/1911.02116.pdf

Larger-Scale Transformers for Multilingual Masked Language Modeling

https://arxiv.org/pdf/2105.00572.pdf

What's New:

  • June 2021: XLMR-XL AND XLMR-XXL models released.

Introduction

XLM-R (XLM-RoBERTa) is a generic cross lingual sentence encoder that obtains state-of-the-art results on many cross-lingual understanding (XLU) benchmarks. It is trained on 2.5T of filtered CommonCrawl data in 100 languages (list below).

Language Language Language Language Language
Afrikaans Albanian Amharic Arabic Armenian
Assamese Azerbaijani Basque Belarusian Bengali
Bengali Romanize Bosnian Breton Bulgarian Burmese
Burmese zawgyi font Catalan Chinese (Simplified) Chinese (Traditional) Croatian
Czech Danish Dutch English Esperanto
Estonian Filipino Finnish French Galician
Georgian German Greek Gujarati Hausa
Hebrew Hindi Hindi Romanize Hungarian Icelandic
Indonesian Irish Italian Japanese Javanese
Kannada Kazakh Khmer Korean Kurdish (Kurmanji)
Kyrgyz Lao Latin Latvian Lithuanian
Macedonian Malagasy Malay Malayalam Marathi
Mongolian Nepali Norwegian Oriya Oromo
Pashto Persian Polish Portuguese Punjabi
Romanian Russian Sanskrit Scottish Gaelic Serbian
Sindhi Sinhala Slovak Slovenian Somali
Spanish Sundanese Swahili Swedish Tamil
Tamil Romanize Telugu Telugu Romanize Thai Turkish
Ukrainian Urdu Urdu Romanize Uyghur Uzbek
Vietnamese Welsh Western Frisian Xhosa Yiddish

Pre-trained models

Model Description #params vocab size Download
xlmr.base XLM-R using the BERT-base architecture 250M 250k xlm.base.tar.gz
xlmr.large XLM-R using the BERT-large architecture 560M 250k xlm.large.tar.gz
xlmr.xl XLM-R (layers=36, model_dim=2560) 3.5B 250k xlm.xl.tar.gz
xlmr.xxl XLM-R (layers=48, model_dim=4096) 10.7B 250k xlm.xxl.tar.gz

Results

XNLI (Conneau et al., 2018)

Model average en fr es de el bg ru tr ar vi th zh hi sw ur
roberta.large.mnli (TRANSLATE-TEST) 77.8 91.3 82.9 84.3 81.2 81.7 83.1 78.3 76.8 76.6 74.2 74.1 77.5 70.9 66.7 66.8
xlmr.large (TRANSLATE-TRAIN-ALL) 83.6 89.1 85.1 86.6 85.7 85.3 85.9 83.5 83.2 83.1 83.7 81.5 83.7 81.6 78.0 78.1
xlmr.xl (TRANSLATE-TRAIN-ALL) 85.4 91.1 87.2 88.1 87.0 87.4 87.8 85.3 85.2 85.3 86.2 83.8 85.3 83.1 79.8 78.2
xlmr.xxl (TRANSLATE-TRAIN-ALL) 86.0 91.5 87.6 88.7 87.8 87.4 88.2 85.6 85.1 85.8 86.3 83.9 85.6 84.6 81.7 80.6

MLQA (Lewis et al., 2018)

Model average en es de ar hi vi zh
BERT-large - 80.2/67.4 - - - - - -
mBERT 57.7 / 41.6 77.7 / 65.2 64.3 / 46.6 57.9 / 44.3 45.7 / 29.8 43.8 / 29.7 57.1 / 38.6 57.5 / 37.3
xlmr.large 70.7 / 52.7 80.6 / 67.8 74.1 / 56.0 68.5 / 53.6 63.1 / 43.5 69.2 / 51.6 71.3 / 50.9 68.0 / 45.4
xlmr.xl 73.4 / 55.3 85.1 / 72.6 66.7 / 46.2 70.5 / 55.5 74.3 / 56.9 72.2 / 54.7 74.4 / 52.9 70.9 / 48.5
xlmr.xxl 74.8 / 56.6 85.5 / 72.4 68.6 / 48.4 72.7 / 57.8 75.4 / 57.6 73.7 / 55.8 76.0 / 55.0 71.7 / 48.9

Example usage

Load XLM-R from torch.hub (PyTorch >= 1.1):
import torch
xlmr = torch.hub.load('pytorch/fairseq', 'xlmr.large')
xlmr.eval()  # disable dropout (or leave in train mode to finetune)
Load XLM-R (for PyTorch 1.0 or custom models):
# Download xlmr.large model
wget https://dl.fbaipublicfiles.com/fairseq/models/xlmr.large.tar.gz
tar -xzvf xlmr.large.tar.gz

# Load the model in fairseq
from fairseq.models.roberta import XLMRModel
xlmr = XLMRModel.from_pretrained('/path/to/xlmr.large', checkpoint_file='model.pt')
xlmr.eval()  # disable dropout (or leave in train mode to finetune)
Apply sentence-piece-model (SPM) encoding to input text:
en_tokens = xlmr.encode('Hello world!')
assert en_tokens.tolist() == [0, 35378,  8999, 38, 2]
xlmr.decode(en_tokens)  # 'Hello world!'

zh_tokens = xlmr.encode('你好,世界')
assert zh_tokens.tolist() == [0, 6, 124084, 4, 3221, 2]
xlmr.decode(zh_tokens)  # '你好,世界'

hi_tokens = xlmr.encode('नमस्ते दुनिया')
assert hi_tokens.tolist() == [0, 68700, 97883, 29405, 2]
xlmr.decode(hi_tokens)  # 'नमस्ते दुनिया'

ar_tokens = xlmr.encode('مرحبا بالعالم')
assert ar_tokens.tolist() == [0, 665, 193478, 258, 1705, 77796, 2]
xlmr.decode(ar_tokens) # 'مرحبا بالعالم'

fr_tokens = xlmr.encode('Bonjour le monde')
assert fr_tokens.tolist() == [0, 84602, 95, 11146, 2]
xlmr.decode(fr_tokens) # 'Bonjour le monde'
Extract features from XLM-R:
# Extract the last layer's features
last_layer_features = xlmr.extract_features(zh_tokens)
assert last_layer_features.size() == torch.Size([1, 6, 1024])

# Extract all layer's features (layer 0 is the embedding layer)
all_layers = xlmr.extract_features(zh_tokens, return_all_hiddens=True)
assert len(all_layers) == 25
assert torch.all(all_layers[-1] == last_layer_features)

Citation

@article{conneau2019unsupervised,
  title={Unsupervised Cross-lingual Representation Learning at Scale},
  author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
  journal={arXiv preprint arXiv:1911.02116},
  year={2019}
}
@article{goyal2021larger,
  title={Larger-Scale Transformers for Multilingual Masked Language Modeling},
  author={Goyal, Naman and Du, Jingfei and Ott, Myle and Anantharaman, Giri and Conneau, Alexis},
  journal={arXiv preprint arXiv:2105.00572},
  year={2021}
}