Spaces:
Sleeping
Sleeping
pipeline_tag: translation | |
license: mit | |
language: | |
- multilingual | |
- af | |
- am | |
- ar | |
- as | |
- ast | |
- ay | |
- az | |
- ba | |
- be | |
- bg | |
- bn | |
- br | |
- bs | |
- ca | |
- ceb | |
- cjk | |
- cs | |
- cy | |
- da | |
- de | |
- dyu | |
- el | |
- en | |
- es | |
- et | |
- fa | |
- ff | |
- fi | |
- fr | |
- fy | |
- ga | |
- gd | |
- gl | |
- gu | |
- ha | |
- he | |
- hi | |
- hr | |
- ht | |
- hu | |
- hy | |
- id | |
- ig | |
- ilo | |
- is | |
- it | |
- ja | |
- jv | |
- ka | |
- kac | |
- kam | |
- kea | |
- kg | |
- kk | |
- km | |
- kmb | |
- kmr | |
- kn | |
- ko | |
- ku | |
- ky | |
- lb | |
- lg | |
- ln | |
- lo | |
- lt | |
- luo | |
- lv | |
- mg | |
- mi | |
- mk | |
- ml | |
- mn | |
- mr | |
- ms | |
- mt | |
- my | |
- ne | |
- nl | |
- no | |
- ns | |
- ny | |
- oc | |
- om | |
- or | |
- pa | |
- pl | |
- ps | |
- pt | |
- qu | |
- ro | |
- ru | |
- sd | |
- shn | |
- si | |
- sk | |
- sl | |
- sn | |
- so | |
- sq | |
- sr | |
- ss | |
- su | |
- sv | |
- sw | |
- ta | |
- te | |
- tg | |
- th | |
- ti | |
- tl | |
- tn | |
- tr | |
- uk | |
- umb | |
- ur | |
- uz | |
- vi | |
- wo | |
- xh | |
- yi | |
- yo | |
- zh | |
- zu | |
# Flores101: Large-Scale Multilingual Machine Translation | |
`flores101_mm100_175M` is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation. It was introduced in this [paper](https://aclanthology.org/2022.tacl-1.30) and released in [this](https://github.com/facebookresearch/fairseq/tree/main/examples/flores101) repository. | |
The model architecture and config are the same as [M2M100](https://huggingface.co./facebook/m2m100_418M) implementation, but the **tokenizer should be modified** to adjust language codes. | |
```python | |
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer | |
hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।" | |
chinese_text = "生活就像一盒巧克力。" | |
model = M2M100ForConditionalGeneration.from_pretrained("seyoungsong/flores101_mm100_175M") | |
tokenizer: M2M100Tokenizer = M2M100Tokenizer.from_pretrained("seyoungsong/flores101_mm100_175M") | |
# FIX TOKENIZER! | |
tokenizer.lang_token_to_id = {t: i for t, i in zip(tokenizer.all_special_tokens, tokenizer.all_special_ids) if i > 5} | |
tokenizer.lang_code_to_token = {s.strip("_"): s for s in tokenizer.lang_token_to_id} | |
tokenizer.lang_code_to_id = {s.strip("_"): i for s, i in tokenizer.lang_token_to_id.items()} | |
tokenizer.id_to_lang_token = {i: s for s, i in tokenizer.lang_token_to_id.items()} | |
# translate Hindi to French | |
tokenizer.src_lang = "hi" | |
encoded_hi = tokenizer(hi_text, return_tensors="pt") | |
generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("fr")) | |
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) | |
# => "La vie est comme une boîte de chocolat." | |
# translate Chinese to English | |
tokenizer.src_lang = "zh" | |
encoded_zh = tokenizer(chinese_text, return_tensors="pt") | |
generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en")) | |
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) | |
# => "Life is like a chocolate box." | |
``` | |
## Languages covered | |
| Language | lang code | | |
| ---------------- | --------- | | |
| Akrikaans | af | | |
| Amharic | am | | |
| Arabic | ar | | |
| Assamese | as | | |
| Asturian | ast | | |
| Aymara | ay | | |
| Azerbaijani | az | | |
| Bashkir | ba | | |
| Belarusian | be | | |
| Bulgarian | bg | | |
| Bengali | bn | | |
| Breton | br | | |
| Bosnian | bs | | |
| Catalan | ca | | |
| Cebuano | ceb | | |
| Chokwe | cjk | | |
| Czech | cs | | |
| Welsh | cy | | |
| Danish | da | | |
| German | de | | |
| Dyula | dyu | | |
| Greek | el | | |
| English | en | | |
| Spanish | es | | |
| Estonian | et | | |
| Persian | fa | | |
| Fulah | ff | | |
| Finnish | fi | | |
| French | fr | | |
| Western Frisian | fy | | |
| Irish | ga | | |
| Scottish Gaelic | gd | | |
| Galician | gl | | |
| Gujarati | gu | | |
| Hausa | ha | | |
| Hebrew | he | | |
| Hindi | hi | | |
| Croatian | hr | | |
| Haitian Creole | ht | | |
| Hungarian | hu | | |
| Armenian | hy | | |
| Indonesian | id | | |
| Igbo | ig | | |
| Iloko | ilo | | |
| Icelandic | is | | |
| Italian | it | | |
| Japanese | ja | | |
| Javanese | jv | | |
| Georgian | ka | | |
| Kachin | kac | | |
| Kamba | kam | | |
| Kabuverdianu | kea | | |
| Kongo | kg | | |
| Kazakh | kk | | |
| Central Khmer | km | | |
| Kimbundu | kmb | | |
| Northern Kurdish | kmr | | |
| Kannada | kn | | |
| Korean | ko | | |
| Kurdish | ku | | |
| Kyrgyz | ky | | |
| Luxembourgish | lb | | |
| Ganda | lg | | |
| Lingala | ln | | |
| Lao | lo | | |
| Lithuanian | lt | | |
| Luo | luo | | |
| Latvian | lv | | |
| Malagasy | mg | | |
| Maori | mi | | |
| Macedonian | mk | | |
| Malayalam | ml | | |
| Mongolian | mn | | |
| Marathi | mr | | |
| Malay | ms | | |
| Maltese | mt | | |
| Burmese | my | | |
| Nepali | ne | | |
| Dutch | nl | | |
| Norwegian | no | | |
| Northern Sotho | ns | | |
| Nyanja | ny | | |
| Occitan | oc | | |
| Oromo | om | | |
| Oriya | or | | |
| Punjabi | pa | | |
| Polish | pl | | |
| Pashto | ps | | |
| Portuguese | pt | | |
| Quechua | qu | | |
| Romanian | ro | | |
| Russian | ru | | |
| Sindhi | sd | | |
| Shan | shn | | |
| Sinhala | si | | |
| Slovak | sk | | |
| Slovenian | sl | | |
| Shona | sn | | |
| Somali | so | | |
| Albanian | sq | | |
| Serbian | sr | | |
| Swati | ss | | |
| Sundanese | su | | |
| Swedish | sv | | |
| Swahili | sw | | |
| Tamil | ta | | |
| Telugu | te | | |
| Tajik | tg | | |
| Thai | th | | |
| Tigrinya | ti | | |
| Tagalog | tl | | |
| Tswana | tn | | |
| Turkish | tr | | |
| Ukrainian | uk | | |
| Umbundu | umb | | |
| Urdu | ur | | |
| Uzbek | uz | | |
| Vietnamese | vi | | |
| Wolof | wo | | |
| Xhosa | xh | | |
| Yiddish | yi | | |
| Yoruba | yo | | |
| Chinese | zh | | |
| Zulu | zu | | |