File size: 2,537 Bytes
1342199
 
d8a974a
 
 
 
 
 
 
 
 
 
 
 
 
5bdf98f
1342199
c762441
cbf952c
c762441
cbf952c
3230947
cbf952c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
634773e
cbf952c
634773e
 
cbf952c
634773e
 
 
cbf952c
634773e
 
cbf952c
 
 
 
cc99544
 
 
 
 
 
 
 
 
 
 
cbf952c
2f01166
634773e
 
 
 
 
 
 
 
 
cbf952c
 
d8a974a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
license: apache-2.0
language:
- multilingual
datasets:
- cis-lmu/Glot500
metrics:
- accuracy
- f1
- perplexity
library_name: transformers
pipeline_tag: fill-mask
tags:
- glot500
- glot
- multilingual
---

# Glot500 (base-sized model) 

Glot500 model (Glot500-m) pre-trained on 500+ languages using a masked language modeling (MLM) objective. It was introduced in
[this paper](https://arxiv.org/pdf/2305.12182.pdf) (ACL 2023) and first released in [this repository](https://github.com/cisnlp/Glot500).


## Usage

You can use this model directly with a pipeline for masked language modeling:

```python
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='cis-lmu/glot500-base')
>>> unmasker("Hello I'm a <mask> model.")
```


Here is how to use this model to get the features of a given text in PyTorch:

```python
>>> from transformers import AutoTokenizer, AutoModelForMaskedLM

>>> tokenizer = AutoTokenizer.from_pretrained('cis-lmu/glot500-base')
>>> model = AutoModelForMaskedLM.from_pretrained("cis-lmu/glot500-base")

>>> # prepare input
>>> text = "Replace me by any text you'd like."
>>> encoded_input = tokenizer(text, return_tensors='pt')

>>> # forward pass
>>> output = model(**encoded_input)
```

### BibTeX entry and citation info

```bibtex
@article{imanigooghari-etal-2023-glot500,
  title={Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages},
  author={ImaniGooghari, Ayyoob  and Lin, Peiqin  and Kargaran, Amir Hossein  and Severini, Silvia  and Jalili Sabet, Masoud  and Kassner, Nora  and Ma, Chunlan  and Schmid, Helmut  and Martins, Andr{\'e}  and Yvon, Fran{\c{c}}ois  and Sch{\"u}tze, Hinrich},
  journal={arXiv preprint arXiv:2305.12182},
  year={2023}
}
```

<!---

```bibtex
@inproceedings{imanigooghari-etal-2023-glot500,
	title        = {Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages},
	author       = {ImaniGooghari, Ayyoob  and Lin, Peiqin  and Kargaran, Amir Hossein  and Severini, Silvia  and Jalili Sabet, Masoud  and Kassner, Nora  and Ma, Chunlan  and Schmid, Helmut  and Martins, Andr{\'e}  and Yvon, Fran{\c{c}}ois  and Sch{\"u}tze, Hinrich},
	year         = 2023,
	month        = jul,
	booktitle    = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
	publisher    = {Association for Computational Linguistics},
	address      = {Toronto, Canada},
	pages        = {1082--1117},
	url          = {https://aclanthology.org/2023.acl-long.61}
}
```
-->