--- pipeline_tag: translation license: mit language: - multilingual - af - am - ar - as - ast - ay - az - ba - be - bg - bn - br - bs - ca - ceb - cjk - cs - cy - da - de - dyu - el - en - es - et - fa - ff - fi - fr - fy - ga - gd - gl - gu - ha - he - hi - hr - ht - hu - hy - id - ig - ilo - is - it - ja - jv - ka - kac - kam - kea - kg - kk - km - kmb - kmr - kn - ko - ku - ky - lb - lg - ln - lo - lt - luo - lv - mg - mi - mk - ml - mn - mr - ms - mt - my - ne - nl - no - ns - ny - oc - om - or - pa - pl - ps - pt - qu - ro - ru - sd - shn - si - sk - sl - sn - so - sq - sr - ss - su - sv - sw - ta - te - tg - th - ti - tl - tn - tr - uk - umb - ur - uz - vi - wo - xh - yi - yo - zh - zu --- # Flores101: Large-Scale Multilingual Machine Translation `flores101_mm100_175M` is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation. It was introduced in this [paper](https://aclanthology.org/2022.tacl-1.30) and released in [this](https://github.com/facebookresearch/fairseq/tree/main/examples/flores101) repository. The model architecture and config are the same as [M2M100](https://huggingface.co./facebook/m2m100_418M) implementation, but the **tokenizer should be modified** to adjust language codes. ```python from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।" chinese_text = "生活就像一盒巧克力。" model = M2M100ForConditionalGeneration.from_pretrained("seyoungsong/flores101_mm100_175M") tokenizer: M2M100Tokenizer = M2M100Tokenizer.from_pretrained("seyoungsong/flores101_mm100_175M") # FIX TOKENIZER! tokenizer.lang_token_to_id = {t: i for t, i in zip(tokenizer.all_special_tokens, tokenizer.all_special_ids) if i > 5} tokenizer.lang_code_to_token = {s.strip("_"): s for s in tokenizer.lang_token_to_id} tokenizer.lang_code_to_id = {s.strip("_"): i for s, i in tokenizer.lang_token_to_id.items()} tokenizer.id_to_lang_token = {i: s for s, i in tokenizer.lang_token_to_id.items()} # translate Hindi to French tokenizer.src_lang = "hi" encoded_hi = tokenizer(hi_text, return_tensors="pt") generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("fr")) tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) # => "La vie est comme une boîte de chocolat." # translate Chinese to English tokenizer.src_lang = "zh" encoded_zh = tokenizer(chinese_text, return_tensors="pt") generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en")) tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) # => "Life is like a chocolate box." ``` ## Languages covered | Language | lang code | | ---------------- | --------- | | Akrikaans | af | | Amharic | am | | Arabic | ar | | Assamese | as | | Asturian | ast | | Aymara | ay | | Azerbaijani | az | | Bashkir | ba | | Belarusian | be | | Bulgarian | bg | | Bengali | bn | | Breton | br | | Bosnian | bs | | Catalan | ca | | Cebuano | ceb | | Chokwe | cjk | | Czech | cs | | Welsh | cy | | Danish | da | | German | de | | Dyula | dyu | | Greek | el | | English | en | | Spanish | es | | Estonian | et | | Persian | fa | | Fulah | ff | | Finnish | fi | | French | fr | | Western Frisian | fy | | Irish | ga | | Scottish Gaelic | gd | | Galician | gl | | Gujarati | gu | | Hausa | ha | | Hebrew | he | | Hindi | hi | | Croatian | hr | | Haitian Creole | ht | | Hungarian | hu | | Armenian | hy | | Indonesian | id | | Igbo | ig | | Iloko | ilo | | Icelandic | is | | Italian | it | | Japanese | ja | | Javanese | jv | | Georgian | ka | | Kachin | kac | | Kamba | kam | | Kabuverdianu | kea | | Kongo | kg | | Kazakh | kk | | Central Khmer | km | | Kimbundu | kmb | | Northern Kurdish | kmr | | Kannada | kn | | Korean | ko | | Kurdish | ku | | Kyrgyz | ky | | Luxembourgish | lb | | Ganda | lg | | Lingala | ln | | Lao | lo | | Lithuanian | lt | | Luo | luo | | Latvian | lv | | Malagasy | mg | | Maori | mi | | Macedonian | mk | | Malayalam | ml | | Mongolian | mn | | Marathi | mr | | Malay | ms | | Maltese | mt | | Burmese | my | | Nepali | ne | | Dutch | nl | | Norwegian | no | | Northern Sotho | ns | | Nyanja | ny | | Occitan | oc | | Oromo | om | | Oriya | or | | Punjabi | pa | | Polish | pl | | Pashto | ps | | Portuguese | pt | | Quechua | qu | | Romanian | ro | | Russian | ru | | Sindhi | sd | | Shan | shn | | Sinhala | si | | Slovak | sk | | Slovenian | sl | | Shona | sn | | Somali | so | | Albanian | sq | | Serbian | sr | | Swati | ss | | Sundanese | su | | Swedish | sv | | Swahili | sw | | Tamil | ta | | Telugu | te | | Tajik | tg | | Thai | th | | Tigrinya | ti | | Tagalog | tl | | Tswana | tn | | Turkish | tr | | Ukrainian | uk | | Umbundu | umb | | Urdu | ur | | Uzbek | uz | | Vietnamese | vi | | Wolof | wo | | Xhosa | xh | | Yiddish | yi | | Yoruba | yo | | Chinese | zh | | Zulu | zu |