murthyrudra's picture
Created Readme
1efdafb
metadata
language:
  - bn
  - gu
  - hi
  - mr
  - ne
  - or
  - pa
  - sa
  - ur
library_name: transformers
pipeline_tag: fill-mask

IA-Original

IA-Original is a multilingual RoBERTa model pre-trained exclusively on 11 Indian languages from the Indo-Aryan language family. It is pre-trained on the monolingual corpora of these languages and subsequently evaluated on a set of diverse tasks.

The 11 languages covered by IA-Original are: Bhojpuri, Bengali, Gujarati, Hindi, Magahi, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Urdu.

The code can be found here. For more information, check-out our paper.

Pretraining Corpus

We pre-trained IA-Original on the publicly available monolingual corpus. The corpus has the following distribution of languages:

Language # Sentences # Tokens
# Total # Unique
Hindi (hi) 1552.89 20,098.73 25.01
Bengali (bn) 353.44 4,021.30 6.5
Sanskrit (sa) 165.35 1,381.04 11.13
Urdu (ur) 153.27 2,465.48 4.61
Marathi (mr) 132.93 1,752.43 4.92
Gujarati (gu) 131.22 1,565.08 4.73
Nepali (ne) 84.21 1,139.54 3.43
Punjabi (pa) 68.02 945.68 2.00
Oriya (or) 17.88 274.99 1.10
Bhojpuri (bh) 10.25 134.37 1.13
Magahi (mag) 0.36 3.47 0.15

Evaluation Results

IA-Original is evaluated on IndicGLUE and some additional tasks. For more details about the tasks, refer to the paper.

Downloads

You can also download it from Huggingface.

Citing

If you are using any of the resources, please cite the following article:

@inproceedings{dhamecha-etal-2021-role,
    title = "Role of {L}anguage {R}elatedness in {M}ultilingual {F}ine-tuning of {L}anguage {M}odels: {A} {C}ase {S}tudy in {I}ndo-{A}ryan {L}anguages",
    author = "Dhamecha, Tejas  and
      Murthy, Rudra  and
      Bharadwaj, Samarth  and
      Sankaranarayanan, Karthik  and
      Bhattacharyya, Pushpak",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.675",
    doi = "10.18653/v1/2021.emnlp-main.675",
    pages = "8584--8595",
}

Contributors

  • Tejas Dhamecha
  • Rudra Murthy
  • Samarth Bharadwaj
  • Karthik Sankaranarayanan
  • Pushpak Bhattacharyya

Contact