opus-mt-tc-bible-big-deu_eng_fra_por_spa-fiu

Model Details
Uses
Risks, Limitations and Biases
How to Get Started With the Model
Training
Evaluation
Citation Information
Acknowledgements

Model Details

Neural machine translation model for translating from unknown (deu+eng+fra+por+spa) to Finno-Ugrian languages (fiu).

This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train. Model Description:

Developed by: Language Technology Research Group at the University of Helsinki
Model Type: Translation (transformer-big)
Release: 2024-05-30
License: Apache-2.0
Language(s):
- Source Language(s): deu eng fra por spa
- Target Language(s): chm est fin fkv hun izh koi kom kpv krl liv mdf mrj myv sma sme smn udm vep vot vro
- Valid Target Language Labels: >>chm<< >>est<< >>fin<< >>fit<< >>fkv<< >>fkv_Latn<< >>hun<< >>izh<< >>kca<< >>koi<< >>kom<< >>kpv<< >>krl<< >>liv<< >>liv_Latn<< >>mdf<< >>mns<< >>mrj<< >>myv<< >>olo<< >>sia<< >>sjd<< >>sje<< >>sjk<< >>sjt<< >>sju<< >>sma<< >>sme<< >>smj<< >>smn<< >>sms<< >>udm<< >>vep<< >>vot<< >>vot_Latn<< >>vro<<
Original Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-30.zip
Resources for more information:

This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of >>id<< (id = valid target language ID), e.g. >>chm<<

Uses

This model can be used for translation and text-to-text generation.

Risks, Limitations and Biases

CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).

How to Get Started With the Model

A short example code:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>chm<< Replace this with text in an accepted source language.",
    ">>vro<< This is the second sentence."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-deu_eng_fra_por_spa-fiu"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

You can also use OPUS-MT models with the transformers pipelines, for example:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-deu_eng_fra_por_spa-fiu")
print(pipe(">>chm<< Replace this with text in an accepted source language."))

Training

Data: opusTCv20230926max50+bt+jhubc (source)
Pre-processing: SentencePiece (spm32k,spm32k)
Model Type: transformer-big
Original MarianNMT Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-30.zip
Training Scripts: GitHub Repo

Evaluation

Model scores at the OPUS-MT dashboard
test set translations: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.test.txt
test set scores: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.eval.txt
benchmark results: benchmark_results.txt
benchmark output: benchmark_translations.zip

langpair	testset	chr-F	BLEU	#sent	#words
deu-est	tatoeba-test-v2021-08-07	0.76586	57.8	244	1413
deu-fin	tatoeba-test-v2021-08-07	0.64286	40.7	2647	15024
deu-hun	tatoeba-test-v2021-08-07	0.57007	31.2	15342	105152
eng-est	tatoeba-test-v2021-08-07	0.69134	50.6	1359	7992
eng-fin	tatoeba-test-v2021-08-07	0.62482	37.6	10690	65122
eng-hun	tatoeba-test-v2021-08-07	0.59750	35.9	13037	79562
fra-fin	tatoeba-test-v2021-08-07	0.65723	45.0	1920	9730
fra-hun	tatoeba-test-v2021-08-07	0.63096	40.6	2494	13753
por-fin	tatoeba-test-v2021-08-07	0.76811	58.1	477	2379
por-hun	tatoeba-test-v2021-08-07	0.64930	42.5	2500	14063
spa-fin	tatoeba-test-v2021-08-07	0.66220	43.4	2513	14131
spa-hun	tatoeba-test-v2021-08-07	0.63596	42.0	2500	14599
eng-fin	flores101-devtest	0.57265	21.9	1012	18781
fra-hun	flores101-devtest	0.52691	21.2	1012	22183
por-fin	flores101-devtest	0.53772	18.6	1012	18781
por-hun	flores101-devtest	0.53275	21.8	1012	22183
spa-est	flores101-devtest	0.50142	15.2	1012	19788
spa-fin	flores101-devtest	0.50401	13.7	1012	18781
deu-est	flores200-devtest	0.55333	21.2	1012	19788
deu-fin	flores200-devtest	0.54020	18.3	1012	18781
deu-hun	flores200-devtest	0.53579	22.0	1012	22183
eng-est	flores200-devtest	0.59496	26.1	1012	19788
eng-fin	flores200-devtest	0.57811	23.1	1012	18781
eng-hun	flores200-devtest	0.57670	26.7	1012	22183
fra-est	flores200-devtest	0.54442	21.2	1012	19788
fra-fin	flores200-devtest	0.53768	18.5	1012	18781
fra-hun	flores200-devtest	0.52691	21.2	1012	22183
por-est	flores200-devtest	0.48227	15.6	1012	19788
por-fin	flores200-devtest	0.53772	18.6	1012	18781
por-hun	flores200-devtest	0.53275	21.8	1012	22183
spa-est	flores200-devtest	0.50142	15.2	1012	19788
spa-fin	flores200-devtest	0.50401	13.7	1012	18781
spa-hun	flores200-devtest	0.49444	16.4	1012	22183
deu-hun	newssyscomb2009	0.49607	18.1	502	9733
eng-hun	newssyscomb2009	0.50580	18.3	502	9733
fra-hun	newssyscomb2009	0.49415	17.8	502	9733
spa-hun	newssyscomb2009	0.48559	16.9	502	9733
deu-hun	newstest2008	0.48855	17.2	2051	41875
eng-hun	newstest2008	0.47636	15.9	2051	41875
fra-hun	newstest2008	0.48598	17.7	2051	41875
spa-hun	newstest2008	0.47888	17.1	2051	41875
deu-hun	newstest2009	0.48692	18.1	2525	54965
eng-hun	newstest2009	0.49507	18.4	2525	54965
fra-hun	newstest2009	0.48961	18.6	2525	54965
spa-hun	newstest2009	0.48496	18.1	2525	54965
eng-fin	newstest2015	0.56896	22.8	1370	19735
eng-fin	newstest2016	0.57934	24.3	3000	47678
eng-fin	newstest2017	0.60204	26.5	3002	45269
eng-est	newstest2018	0.56276	23.8	2000	36269
eng-fin	newstest2018	0.52953	17.4	3000	44836
eng-fin	newstest2019	0.55882	24.2	1997	38369
eng-fin	newstestALL2016	0.57934	24.3	3000	47678
eng-fin	newstestALL2017	0.60204	26.5	3002	45269
eng-fin	newstestB2016	0.54388	19.9	3000	45766
eng-fin	newstestB2017	0.56369	22.6	3002	45506
deu-est	ntrex128	0.51761	18.6	1997	38420
deu-fin	ntrex128	0.50759	15.5	1997	35701
deu-hun	ntrex128	0.46171	15.6	1997	44462
eng-est	ntrex128	0.57099	24.4	1997	38420
eng-fin	ntrex128	0.53413	18.5	1997	35701
eng-hun	ntrex128	0.47342	16.6	1997	44462
fra-est	ntrex128	0.50712	17.7	1997	38420
fra-fin	ntrex128	0.49215	14.2	1997	35701
fra-hun	ntrex128	0.44873	14.9	1997	44462
por-est	ntrex128	0.48098	15.1	1997	38420
por-fin	ntrex128	0.50875	15.0	1997	35701
por-hun	ntrex128	0.45817	15.5	1997	44462
spa-est	ntrex128	0.52158	18.5	1997	38420
spa-fin	ntrex128	0.50947	15.2	1997	35701
spa-hun	ntrex128	0.46051	16.1	1997	44462

Citation Information

Publications: Democratizing neural machine translation with OPUS-MT and OPUS-MT – Building open translation services for the World and The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT (Please, cite if you use this model.)

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

Acknowledgements

The work is supported by the HPLT project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland, and the EuroHPC supercomputer LUMI.

Model conversion info

transformers version: 4.45.1
OPUS-MT git hash: 0882077
port time: Tue Oct 8 09:01:19 EEST 2024
port machine: LM0-400-22516.local

Helsinki-NLP
/

opus-mt-tc-bible-big-deu_eng_fra_por_spa-fiu