seyoungsong's picture
r
68fdd8d verified
|
raw
history blame
7.15 kB
---
pipeline_tag: translation
license: mit
language:
- multilingual
- af
- am
- ar
- as
- ast
- ay
- az
- ba
- be
- bg
- bn
- br
- bs
- ca
- ceb
- cjk
- cs
- cy
- da
- de
- dyu
- el
- en
- es
- et
- fa
- ff
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- ht
- hu
- hy
- id
- ig
- ilo
- is
- it
- ja
- jv
- ka
- kac
- kam
- kea
- kg
- kk
- km
- kmb
- kmr
- kn
- ko
- ku
- ky
- lb
- lg
- ln
- lo
- lt
- luo
- lv
- mg
- mi
- mk
- ml
- mn
- mr
- ms
- mt
- my
- ne
- nl
- no
- ns
- ny
- oc
- om
- or
- pa
- pl
- ps
- pt
- qu
- ro
- ru
- sd
- shn
- si
- sk
- sl
- sn
- so
- sq
- sr
- ss
- su
- sv
- sw
- ta
- te
- tg
- th
- ti
- tl
- tn
- tr
- uk
- umb
- ur
- uz
- vi
- wo
- xh
- yi
- yo
- zh
- zu
---
# Flores101: Large-Scale Multilingual Machine Translation
`flores101_mm100_175M` is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation. It was introduced in this [paper](https://aclanthology.org/2022.tacl-1.30) and released in [this](https://github.com/facebookresearch/fairseq/tree/main/examples/flores101) repository.
The model architecture and config are the same as [M2M100](https://huggingface.co./facebook/m2m100_418M) implementation, but the **tokenizer should be modified** to adjust language codes.
```python
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।"
chinese_text = "生活就像一盒巧克力。"
model = M2M100ForConditionalGeneration.from_pretrained("seyoungsong/flores101_mm100_175M")
tokenizer: M2M100Tokenizer = M2M100Tokenizer.from_pretrained("seyoungsong/flores101_mm100_175M")
# FIX TOKENIZER!
tokenizer.lang_token_to_id = {t: i for t, i in zip(tokenizer.all_special_tokens, tokenizer.all_special_ids) if i > 5}
tokenizer.lang_code_to_token = {s.strip("_"): s for s in tokenizer.lang_token_to_id}
tokenizer.lang_code_to_id = {s.strip("_"): i for s, i in tokenizer.lang_token_to_id.items()}
tokenizer.id_to_lang_token = {i: s for s, i in tokenizer.lang_token_to_id.items()}
# translate Hindi to French
tokenizer.src_lang = "hi"
encoded_hi = tokenizer(hi_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("fr"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "La vie est comme une boîte de chocolat."
# translate Chinese to English
tokenizer.src_lang = "zh"
encoded_zh = tokenizer(chinese_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "Life is like a chocolate box."
```
## Languages covered
| Language | lang code |
| ---------------- | --------- |
| Akrikaans | af |
| Amharic | am |
| Arabic | ar |
| Assamese | as |
| Asturian | ast |
| Aymara | ay |
| Azerbaijani | az |
| Bashkir | ba |
| Belarusian | be |
| Bulgarian | bg |
| Bengali | bn |
| Breton | br |
| Bosnian | bs |
| Catalan | ca |
| Cebuano | ceb |
| Chokwe | cjk |
| Czech | cs |
| Welsh | cy |
| Danish | da |
| German | de |
| Dyula | dyu |
| Greek | el |
| English | en |
| Spanish | es |
| Estonian | et |
| Persian | fa |
| Fulah | ff |
| Finnish | fi |
| French | fr |
| Western Frisian | fy |
| Irish | ga |
| Scottish Gaelic | gd |
| Galician | gl |
| Gujarati | gu |
| Hausa | ha |
| Hebrew | he |
| Hindi | hi |
| Croatian | hr |
| Haitian Creole | ht |
| Hungarian | hu |
| Armenian | hy |
| Indonesian | id |
| Igbo | ig |
| Iloko | ilo |
| Icelandic | is |
| Italian | it |
| Japanese | ja |
| Javanese | jv |
| Georgian | ka |
| Kachin | kac |
| Kamba | kam |
| Kabuverdianu | kea |
| Kongo | kg |
| Kazakh | kk |
| Central Khmer | km |
| Kimbundu | kmb |
| Northern Kurdish | kmr |
| Kannada | kn |
| Korean | ko |
| Kurdish | ku |
| Kyrgyz | ky |
| Luxembourgish | lb |
| Ganda | lg |
| Lingala | ln |
| Lao | lo |
| Lithuanian | lt |
| Luo | luo |
| Latvian | lv |
| Malagasy | mg |
| Maori | mi |
| Macedonian | mk |
| Malayalam | ml |
| Mongolian | mn |
| Marathi | mr |
| Malay | ms |
| Maltese | mt |
| Burmese | my |
| Nepali | ne |
| Dutch | nl |
| Norwegian | no |
| Northern Sotho | ns |
| Nyanja | ny |
| Occitan | oc |
| Oromo | om |
| Oriya | or |
| Punjabi | pa |
| Polish | pl |
| Pashto | ps |
| Portuguese | pt |
| Quechua | qu |
| Romanian | ro |
| Russian | ru |
| Sindhi | sd |
| Shan | shn |
| Sinhala | si |
| Slovak | sk |
| Slovenian | sl |
| Shona | sn |
| Somali | so |
| Albanian | sq |
| Serbian | sr |
| Swati | ss |
| Sundanese | su |
| Swedish | sv |
| Swahili | sw |
| Tamil | ta |
| Telugu | te |
| Tajik | tg |
| Thai | th |
| Tigrinya | ti |
| Tagalog | tl |
| Tswana | tn |
| Turkish | tr |
| Ukrainian | uk |
| Umbundu | umb |
| Urdu | ur |
| Uzbek | uz |
| Vietnamese | vi |
| Wolof | wo |
| Xhosa | xh |
| Yiddish | yi |
| Yoruba | yo |
| Chinese | zh |
| Zulu | zu |