RoBERTa Hindi

HuggingFace Flax-Community Project Forum link: https://discuss.huggingface.co/t/pretrain-roberta-from-scratch-in-hindi Pretrain RoBERTa for Hindi Language and compare the performance on various down stream tasks with existing models

In order to create a stable pipeline for building a large corpus of Hindi-text datasets, we brought different datasets available on HuggingFaces's datasethub and kaggle under a single wrapper datasets library. Dataset loading scripts for all the hindi datasets we could find are available in datasets/ directory.

concatenate_datasets_and_clean.py merges all datasets(with cleaning hooks) together into a single large dataset which could be fed seamlessly into Huggingface transformers MLM training loop.

Indic-glue Fix

Just a minor fix as WikiNER hindi downstream task on HuggingFace datasets library was buggy. Need to load indic_glue_dataset_fixed/indic_glue.py instead to perform downstream evaluation.

Benchmarks:

Using IndicGlue benchmarks, the benchmarks/ directory contains helpers which could be used to fine-tune & perform downstream evaluation on the following tasks (for multiple models available on modelhub as reference, and log results to wandb):

BBCA news classification
IITP product reviews
IITP movie reviews
WikiNER (Named entity recognition)

Note: We are still in the process on cleaning up some colab notebooks and will commit an end2end working script soon.

Demo

Check out the Demo here!

Example code to use our Model for Mask Filling Task

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='flax-community/roberta-hindi')
>>> unmasker("हम आपके सुखद <mask> की कामना करते हैं")
[{'score': 0.3310680091381073,
  'sequence': 'हम आपके सुखद सफर की कामना करते हैं',
  'token': 1349,
  'token_str': ' सफर'},
 {'score': 0.15317578613758087,
  'sequence': 'हम आपके सुखद पल की कामना करते हैं',
  'token': 848,
  'token_str': ' पल'},
 {'score': 0.07826550304889679,
  'sequence': 'हम आपके सुखद समय की कामना करते हैं',
  'token': 453,
  'token_str': ' समय'},
 {'score': 0.06304813921451569,
  'sequence': 'हम आपके सुखद पहल की कामना करते हैं',
  'token': 404,
  'token_str': ' पहल'},
 {'score': 0.058322224766016006,
  'sequence': 'हम आपके सुखद अवसर की कामना करते हैं',
  'token': 857,
  'token_str': ' अवसर'}]
"""

Training data

The RoBERTa Hindi model was pretrained on the reunion of the following datasets:

OSCAR is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
mC4 is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus.
IndicGLUE is a natural language understanding benchmark.
Samanantar is a parallel corpora collection for Indic language.
Hindi Text Short and Large Summarization Corpus is a collection of ~180k articles with their headlines and summary collected from Hindi News Websites.
Hindi Text Short Summarization Corpus is a collection of ~330k articles with their headlines collected from Hindi News Websites.
Old Newspapers Hindi is a cleaned subset of HC Corpora newspapers.

Evaluation Results

RoBERTa Hindi is evaluated on various downstream tasks. The results are summarized below.

Task	Task Type	IndicBERT	HindiBERTa	Indic Transformers Hindi BERT	RoBERTa Hindi Guj San	RoBERTa Hindi
BBC News Classification	Genre Classification	76.44	66.86	77.6	64.9	73.67
WikiNER	Token Classification	-	90.68	95.09	89.61	92.76
IITP Product Reviews	Sentiment Analysis	78.01	73.23	78.39	66.16	75.53
IITP Movie Reviews	Sentiment Analysis	60.97	52.26	70.65	49.35	61.29

Team Members

Aman K (amankhandelia)
Haswanth Aekula (hassiahk)
Kartik Godawat (dk-crazydiv)
Prateek Agrawal (prateekagrawal)
Rahul Dev (mlkorra)

Future Work

Create the best Hindi Language Model to use in real world applications.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
benchmarks		benchmarks
datasets		datasets
images		images
indic_glue_dataset_fixed		indic_glue_dataset_fixed
tmp		tmp
.gitignore		.gitignore
README.md		README.md
concatenate_datasets_and_clean.py		concatenate_datasets_and_clean.py
test_custom_downstream_mlm.py		test_custom_downstream_mlm.py
utils.py		utils.py
wikiner_incorrect_eval_set.csv		wikiner_incorrect_eval_set.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RoBERTa Hindi

Contents

Datasets

Indic-glue Fix

Benchmarks:

Demo

Example code to use our Model for Mask Filling Task

Training data

Evaluation Results

Team Members

Future Work

About

Releases

Packages

Contributors 4

Languages

amankhandelia/roberta_hindi

Folders and files

Latest commit

History

Repository files navigation

RoBERTa Hindi

Contents

Datasets

Indic-glue Fix

Benchmarks:

Demo

Example code to use our Model for Mask Filling Task

Training data

Evaluation Results

Team Members

Future Work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages