mBART-50

mBART-50 is a multilingual Sequence-to-Sequence model pre-trained using the "Multilingual Denoising Pretraining" objective. It was introduced in Multilingual Translation with Extensible Multilingual Pretraining and Finetuning paper.

Model description

mBART-50 is a multilingual Sequence-to-Sequence model. It was introduced to show that multilingual translation models can be created through multilingual fine-tuning. Instead of fine-tuning on one direction, a pre-trained model is fine-tuned on many directions simultaneously. mBART-50 is created using the original mBART model and extended to add extra 25 languages to support multilingual machine translation models of 50 languages. The pre-training objective is explained below.

Multilingual Denoising Pretraining: The model incorporates N languages by concatenating data: D = {D1, ..., DN } where each Di is a collection of monolingual documents in language i. The source documents are noised using two schemes, first randomly shuffling the original sentences' order, and second a novel in-filling scheme, where spans of text are replaced with a single mask token. The model is then tasked to reconstruct the original text. 35% of each instance's words are masked by random sampling a span length according to a Poisson distribution (λ = 3.5). The decoder input is the original text with one position offset. A language id symbol LID is used as the initial token to predict the sentence.

Checking

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer


model = AutoModelForSeq2SeqLM.from_pretrained('facebook/mbart-large-50')
tokenizer = AutoTokenizer.from_pretrained('facebook/mbart-large-50')

src_text = "UN Chief Says There Is <mask> Military Solution <mask> Syria"
encoded_hi = tokenizer(src_text, return_tensors="pt")
generated_output = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"], 
                                  return_dict_in_generate=True, return_dict=True, output_hidden_states=True)
text_output = tokenizer.batch_decode(generated_output.sequences, skip_special_tokens=True)


new_model = AutoModelForSeq2SeqLM.from_pretrained('nguyenvulebinh/mbart-large-50-latin-only')
new_tokenizer = AutoTokenizer.from_pretrained('nguyenvulebinh/mbart-large-50-latin-only')
new_encoded_hi = new_tokenizer(src_text, return_tensors="pt")
new_generated_output = new_model.generate(**new_encoded_hi, forced_bos_token_id=new_tokenizer.lang_code_to_id["en_XX"], 
                                          return_dict_in_generate=True, return_dict=True, output_hidden_states=True)
new_text_output = new_tokenizer.batch_decode(new_generated_output.sequences, skip_special_tokens=True)

assert text_output == new_text_output
assert torch.equal(generated_output.encoder_hidden_states[-1], new_generated_output.encoder_hidden_states[-1])
assert torch.equal(generated_output.decoder_hidden_states[-1][-1], new_generated_output.decoder_hidden_states[-1][-1])

print(new_text_output)
# ['UN Chief Says There Is  No Military Solution  to the War in Syria']

Languages covered

English (en_XX)

BibTeX entry and citation info

@article{tang2020multilingual,
    title={Multilingual Translation with Extensible Multilingual Pretraining and Finetuning},
    author={Yuqing Tang and Chau Tran and Xian Li and Peng-Jen Chen and Naman Goyal and Vishrav Chaudhary and Jiatao Gu and Angela Fan},
    year={2020},
    eprint={2008.00401},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
4
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.