aya-101 / README.md
viraat's picture
Correct some stuff
aec01f8
|
raw
history blame
15.7 kB
---
license: apache-2.0
datasets:
- CohereForAI/xP3x
- CohereForAI/aya_dataset
- CohereForAI/aya_collection
- DataProvenanceInitiative/Commercially-Verified-Licenses
- CohereForAI/aya_evaluation_suite
language:
- afr
- amh
- ara
- aze
- bel
- ben
- bul
- cat
- ceb
- ces
- cym
- dan
- deu
- ell
- eng
- epo
- est
- eus
- fin
- fil
- fra
- fry
- gla
- gle
- glg
- guj
- hat
- hau
- heb
- hin
- hun
- hye
- ibo
- ind
- isl
- ita
- jav
- jpn
- kan
- kat
- kaz
- khm
- kir
- kor
- kur
- lao
- lav
- lat
- lit
- ltz
- mal
- mar
- mkd
- mlg
- mlt
- mon
- mri
- msa
- mya
- nep
- nld
- nor
- nso
- nya
- ory
- pan
- pes
- pol
- por
- pus
- ron
- rus
- sin
- slk
- slv
- smo
- sna
- snd
- som
- sot
- spa
- sqi
- srp
- sun
- swa
- swe
- tam
- tel
- tgk
- tha
- tur
- twi
- ukr
- urd
- uzb
- vie
- xho
- yid
- yor
- zho
- zul
metrics:
- accuracy
- bleu
---
<img src="aya-fig1.png" alt="Aya model summary image" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
# Model Card for Aya Model
## Model Summary
> The Aya model is a massively multilingual generative language model that follows instructions in 101 languages.
> Aya outperforms [mT0](https://huggingface.co./bigscience/mt0-xxl) and [BLOOMZ](https://huggingface.co./bigscience/bloomz) a wide variety of automatic and human evaluations despite covering double the number of languages.
> The Aya model is trained using [xP3x](https://huggingface.co./datasets/CohereForAI/xP3x), [Aya Dataset](https://huggingface.co./datasets/CohereForAI/aya_dataset), [Aya Collection](https://huggingface.co./datasets/CohereForAI/aya_collection), a subset of [DataProvenance collection](https://huggingface.co./datasets/DataProvenanceInitiative/Commercially-Verified-Licenses) and ShareGPT-Command.
> We release the checkpoints under a Apache-2.0 license to further our mission of multilingual technologies empowering a
> multilingual world.
- **Developed by:** Cohere For AI
- **Model type:** a Transformer style autoregressive massively multilingual language model.
- **Paper**: [Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model](arxiv.com)
- **Point of Contact**: [Ahmet Ustun](mailto:[email protected])
- **Languages**: Refer to the list of languages in the `language` section of this model card.
- **License**: Apache-2.0
- **Model**: [Aya](https://huggingface.co./CohereForAI/aya)
- **Model Size**: 13 billion parameters
- **Datasets**: [xP3x](https://huggingface.co./datasets/CohereForAI/xP3x), [Aya Dataset](https://huggingface.co./datasets/CohereForAI/aya_dataset), [Aya Collection](https://huggingface.co./datasets/CohereForAI/aya_collection), [DataProvenance collection](https://huggingface.co./datasets/DataProvenanceInitiative/Commercially-Verified-Licenses), ShareGPT-Command.
## Use
```bash
# pip install -q transformers
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
checkpoint = "CohereForAI/aya_model"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
aya_model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
inputs = tokenizer.encode("Translate to English: Je t’aime.", return_tensors="pt")
outputs = aya_model.generate(inputs)
print(tokenizer.decode(outputs[0]))
```
## Model Details
### Training
- Architecture: Same as [mt5-xxl](https://huggingface.co./google/mt5-xxl)
- Number of Finetuning Samples: 25M
- Batch size: 256
- Hardware: TPUv4-128
- Software: T5X, Jax
### Data Sources
The Aya model is trained on the following datasets:
- [xP3x](https://huggingface.co./datasets/CohereForAI/xP3x)
- [Aya Dataset](https://huggingface.co./datasets/CohereForAI/aya_dataset)
- [Aya Collection](https://huggingface.co./datasets/CohereForAI/aya_collection)
- [DataProvenance collection](https://huggingface.co./datasets/DataProvenanceInitiative/Commercially-Verified-Licenses)
- ShareGPT-Command
All datasets are subset to the 101 languages supported by [mT5]. See the [paper](arxiv.com) for details about filtering and pruning.
## Evaluation
We refer to Section 5 from our paper for multilingual eval across 99 languages – including discriminative, generative tasks, human evaluation and simulated win rates that cover both held-out tasks and in-distribution performance.
## Bias, Risks, and Limitations
For a detailed overview of our effort at safety mitigation and benchmarking toxicity and bias across multiple languages, we refer Sections 6 and 7 of our paper: [Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model](arxiv.com).
We hope that the release of the Aya model will make community-based redteaming efforts possible, by exposing an open-source massively-multilingual model for community research.
## Citation
**BibTeX:**
```
@article{,
title={},
author={},
journal={Preprint},
year={2024}
}
```
**APA:**
## Languages Covered
Below is the list of languages used in finetuning the Aya Model. We group languages into higher-, mid-, and lower-resourcedness based on a language classification by [Joshi et. al, 2020](https://microsoft.github.io/linguisticdiversity/). For further details, refer to our [paper]()
| ISO Code | Language Name | Script | Family | Subgrouping | Resourcedness |
| :------- | :-------------- | :----------: | :-------------: | :---------------: | :-----------: |
| afr | Afrikaans | Latin | Indo-European | Germanic | Mid |
| amh | Amharic | Ge'ez | Afro-Asiatic | Semitic | Low |
| ara | Arabic | Arabic | Afro-Asiatic | Semitic | High |
| aze | Azerbaijani | Arabic/Latin | Turkic | Common Turkic | Low |
| bel | Belarusian | Cyrillic | Indo-European | Balto-Slavic | Mid |
| ben | Bengali | Bengali | Indo-European | Indo-Aryan | Mid |
| bul | Bulgarian | Cyrillic | Indo-European | Balto-Slavic | Mid |
| cat | Catalan | Latin | Indo-European | Italic | High |
| ceb | Cebuano | Latin | Austronesian | Malayo-Polynesian | Mid |
| ces | Czech | Latin | Indo-European | Balto-Slavic | High |
| cym | Welsh | Latin | Indo-European | Celtic | Low |
| dan | Danish | Latin | Indo-European | Germanic | Mid |
| deu | German | Latin | Indo-European | Germanic | High |
| ell | Greek | Greek | Indo-European | Graeco-Phrygian | Mid |
| eng | English | Latin | Indo-European | Germanic | High |
| epo | Esperanto | Latin | Constructed | Esperantic | Low |
| est | Estonian | Latin | Uralic | Finnic | Mid |
| eus | Basque | Latin | Basque | - | High |
| fin | Finnish | Latin | Uralic | Finnic | High |
| fil | Tagalog | Latin | Austronesian | Malayo-Polynesian | Mid |
| fra | French | Latin | Indo-European | Italic | High |
| fry | Western Frisian | Latin | Indo-European | Germanic | Low |
| gla | Scottish Gaelic | Latin | Indo-European | Celtic | Low |
| gle | Irish | Latin | Indo-European | Celtic | Low |
| glg | Galician | Latin | Indo-European | Italic | Mid |
| guj | Gujarati | Gujarati | Indo-European | Indo-Aryan | Low |
| hat | Haitian Creole | Latin | Indo-European | Italic | Low |
| hau | Hausa | Latin | Afro-Asiatic | Chadic | Low |
| heb | Hebrew | Hebrew | Afro-Asiatic | Semitic | Mid |
| hin | Hindi | Devanagari | Indo-European | Indo-Aryan | High |
| hun | Hungarian | Latin | Uralic | - | High |
| hye | Armenian | Armenian | Indo-European | Armenic | Low |
| ibo | Igbo | Latin | Atlantic-Congo | Benue-Congo | Low |
| ind | Indonesian | Latin | Austronesian | Malayo-Polynesian | Mid |
| isl | Icelandic | Latin | Indo-European | Germanic | Low |
| ita | Italian | Latin | Indo-European | Italic | High |
| jav | Javanese | Latin | Austronesian | Malayo-Polynesian | Low |
| jpn | Japanese | Japanese | Japonic | Japanesic | High |
| kan | Kannada | Kannada | Dravidian | South Dravidian | Low |
| kat | Georgian | Georgian | Kartvelian | Georgian-Zan | Mid |
| kaz | Kazakh | Cyrillic | Turkic | Common Turkic | Mid |
| khm | Khmer | Khmer | Austroasiatic | Khmeric | Low |
| kir | Kyrgyz | Cyrillic | Turkic | Common Turkic | Low |
| kor | Korean | Hangul | Koreanic | Korean | High |
| kur | Kurdish | Latin | Indo-European | Iranian | Low |
| lao | Lao | Lao | Tai-Kadai | Kam-Tai | Low |
| lav | Latvian | Latin | Indo-European | Balto-Slavic | Mid |
| lat | Latin | Latin | Indo-European | Italic | Mid |
| lit | Lithuanian | Latin | Indo-European | Balto-Slavic | Mid |
| ltz | Luxembourgish | Latin | Indo-European | Germanic | Low |
| mal | Malayalam | Malayalam | Dravidian | South Dravidian | Low |
| mar | Marathi | Devanagari | Indo-European | Indo-Aryan | Low |
| mkd | Macedonian | Cyrillic | Indo-European | Balto-Slavic | Low |
| mlg | Malagasy | Latin | Austronesian | Malayo-Polynesian | Low |
| mlt | Maltese | Latin | Afro-Asiatic | Semitic | Low |
| mon | Mongolian | Cyrillic | Mongolic-Khitan | Mongolic | Low |
| mri | Maori | Latin | Austronesian | Malayo-Polynesian | Low |
| msa | Malay | Latin | Austronesian | Malayo-Polynesian | Mid |
| mya | Burmese | Myanmar | Sino-Tibetan | Burmo-Qiangic | Low |
| nep | Nepali | Devanagari | Indo-European | Indo-Aryan | Low |
| nld | Dutch | Latin | Indo-European | Germanic | High |
| nor | Norwegian | Latin | Indo-European | Germanic | Low |
| nso | Northern Sotho | Latin | Atlantic-Congo | Benue-Congo | Low |
| nya | Chichewa | Latin | Atlantic-Congo | Benue-Congo | Low |
| ory | Oriya | Oriya | Indo-European | Indo-Aryan | Low |
| pan | Punjabi | Gurmukhi | Indo-European | Indo-Aryan | Low |
| pes | Persian | Arabic | Indo-European | Iranian | High |
| pol | Polish | Latin | Indo-European | Balto-Slavic | High |
| por | Portuguese | Latin | Indo-European | Italic | High |
| pus | Pashto | Arabic | Indo-European | Iranian | Low |
| ron | Romanian | Latin | Indo-European | Italic | Mid |
| rus | Russian | Cyrillic | Indo-European | Balto-Slavic | High |
| sin | Sinhala | Sinhala | Indo-European | Indo-Aryan | Low |
| slk | Slovak | Latin | Indo-European | Balto-Slavic | Mid |
| slv | Slovenian | Latin | Indo-European | Balto-Slavic | Mid |
| smo | Samoan | Latin | Austronesian | Malayo-Polynesian | Low |
| sna | Shona | Latin | Indo-European | Indo-Aryan | Low |
| snd | Sindhi | Arabic | Indo-European | Indo-Aryan | Low |
| som | Somali | Latin | Afro-Asiatic | Cushitic | Low |
| sot | Southern Sotho | Latin | Atlantic-Congo | Benue-Congo | Low |
| spa | Spanish | Latin | Indo-European | Italic | High |
| sqi | Albanian | Latin | Indo-European | Albanian | Low |
| srp | Serbian | Cyrillic | Indo-European | Balto-Slavic | High |
| sun | Sundanese | Latin | Austronesian | Malayo-Polynesian | Low |
| swa | Swahili | Latin | Atlantic-Congo | Benue-Congo | Low |
| swe | Swedish | Latin | Indo-European | Germanic | High |
| tam | Tamil | Tamil | Dravidian | South Dravidian | Mid |
| tel | Telugu | Telugu | Dravidian | South Dravidian | Low |
| tgk | Tajik | Cyrillic | Indo-European | Iranian | Low |
| tha | Thai | Thai | Tai-Kadai | Kam-Tai | Mid |
| tur | Turkish | Latin | Turkic | Common Turkic | High |
| twi | Twi | Latin | Atlantic-Congo | Niger-Congo | Low |
| ukr | Ukrainian | Cyrillic | Indo-European | Balto-Slavic | Mid |
| urd | Urdu | Arabic | Indo-European | Indo-Aryan | Mid |
| uzb | Uzbek | Latin | Turkic | Common Turkic | Mid |
| vie | Vietnamese | Latin | Austroasiatic | Vietic | High |
| xho | Xhosa | Latin | Atlantic-Congo | Benue-Congo | Low |
| yid | Yiddish | Hebrew | Indo-European | Germanic | Low |
| yor | Yoruba | Latin | Atlantic-Congo | Benue-Congo | Low |
| zho | Chinese | Han | Sino-Tibetan | Sinitic | High |
| zul | Zulu | Latin | Atlantic-Congo | Benue-Congo | Low |
## Model Card Contact
For errors in this model card, contact Ahmet or Viraat, `{ahmet, viraat} at cohere dot com`.