File size: 2,537 Bytes
1342199 d8a974a 5bdf98f 1342199 c762441 cbf952c c762441 cbf952c 3230947 cbf952c 634773e cbf952c 634773e cbf952c 634773e cbf952c 634773e cbf952c cc99544 cbf952c 2f01166 634773e cbf952c d8a974a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
---
license: apache-2.0
language:
- multilingual
datasets:
- cis-lmu/Glot500
metrics:
- accuracy
- f1
- perplexity
library_name: transformers
pipeline_tag: fill-mask
tags:
- glot500
- glot
- multilingual
---
# Glot500 (base-sized model)
Glot500 model (Glot500-m) pre-trained on 500+ languages using a masked language modeling (MLM) objective. It was introduced in
[this paper](https://arxiv.org/pdf/2305.12182.pdf) (ACL 2023) and first released in [this repository](https://github.com/cisnlp/Glot500).
## Usage
You can use this model directly with a pipeline for masked language modeling:
```python
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='cis-lmu/glot500-base')
>>> unmasker("Hello I'm a <mask> model.")
```
Here is how to use this model to get the features of a given text in PyTorch:
```python
>>> from transformers import AutoTokenizer, AutoModelForMaskedLM
>>> tokenizer = AutoTokenizer.from_pretrained('cis-lmu/glot500-base')
>>> model = AutoModelForMaskedLM.from_pretrained("cis-lmu/glot500-base")
>>> # prepare input
>>> text = "Replace me by any text you'd like."
>>> encoded_input = tokenizer(text, return_tensors='pt')
>>> # forward pass
>>> output = model(**encoded_input)
```
### BibTeX entry and citation info
```bibtex
@article{imanigooghari-etal-2023-glot500,
title={Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages},
author={ImaniGooghari, Ayyoob and Lin, Peiqin and Kargaran, Amir Hossein and Severini, Silvia and Jalili Sabet, Masoud and Kassner, Nora and Ma, Chunlan and Schmid, Helmut and Martins, Andr{\'e} and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich},
journal={arXiv preprint arXiv:2305.12182},
year={2023}
}
```
<!---
```bibtex
@inproceedings{imanigooghari-etal-2023-glot500,
title = {Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages},
author = {ImaniGooghari, Ayyoob and Lin, Peiqin and Kargaran, Amir Hossein and Severini, Silvia and Jalili Sabet, Masoud and Kassner, Nora and Ma, Chunlan and Schmid, Helmut and Martins, Andr{\'e} and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich},
year = 2023,
month = jul,
booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
publisher = {Association for Computational Linguistics},
address = {Toronto, Canada},
pages = {1082--1117},
url = {https://aclanthology.org/2023.acl-long.61}
}
```
--> |