opus-mt-tc-big-itc-itc

Table of Contents

Model Details

Neural machine translation model for translating from Italic languages (itc) to Italic languages (itc).

This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train. Model Description:

  • Developed by: Language Technology Research Group at the University of Helsinki
  • Model Type: Translation (transformer-big)
  • Release: 2022-08-10
  • License: CC-BY-4.0
  • Language(s):
    • Source Language(s): ast cat cbk fra fro glg hat ita lad lad_Latn lat lat_Latn lij lld oci pms por ron spa
    • Target Language(s): ast cat fra gcf glg hat ita lad lad_Latn lat lat_Latn oci por ron spa
    • Language Pair(s): ast-cat ast-fra ast-glg ast-ita ast-oci ast-por ast-ron ast-spa cat-ast cat-fra cat-glg cat-ita cat-oci cat-por cat-ron cat-spa fra-ast fra-cat fra-glg fra-ita fra-oci fra-por fra-ron fra-spa glg-ast glg-cat glg-fra glg-ita glg-oci glg-por glg-ron glg-spa ita-ast ita-cat ita-fra ita-glg ita-oci ita-por ita-ron ita-spa lad-spa lad_Latn-spa oci-ast oci-cat oci-fra oci-glg oci-ita oci-por oci-ron oci-spa pms-ita por-ast por-cat por-fra por-glg por-ita por-oci por-ron por-spa ron-ast ron-cat ron-fra ron-glg ron-ita ron-oci ron-por ron-spa spa-cat spa-fra spa-glg spa-ita spa-por spa-ron
    • Valid Target Language Labels: >>acf<< >>aoa<< >>arg<< >>ast<< >>cat<< >>cbk<< >>cbk_Latn<< >>ccd<< >>cks<< >>cos<< >>cri<< >>crs<< >>dlm<< >>drc<< >>egl<< >>ext<< >>fab<< >>fax<< >>fra<< >>frc<< >>frm<< >>frm_Latn<< >>fro<< >>fro_Latn<< >>frp<< >>fur<< >>fur_Latn<< >>gcf<< >>gcf_Latn<< >>gcr<< >>glg<< >>hat<< >>idb<< >>ist<< >>ita<< >>itk<< >>kea<< >>kmv<< >>lad<< >>lad_Latn<< >>lat<< >>lat_Grek<< >>lat_Latn<< >>lij<< >>lld<< >>lld_Latn<< >>lmo<< >>lou<< >>mcm<< >>mfe<< >>mol<< >>mwl<< >>mxi<< >>mzs<< >>nap<< >>nrf<< >>oci<< >>osc<< >>osp<< >>osp_Latn<< >>pap<< >>pcd<< >>pln<< >>pms<< >>pob<< >>por<< >>pov<< >>pre<< >>pro<< >>qbb<< >>qhr<< >>rcf<< >>rgn<< >>roh<< >>ron<< >>ruo<< >>rup<< >>ruq<< >>scf<< >>scn<< >>sdc<< >>sdn<< >>spa<< >>spq<< >>spx<< >>src<< >>srd<< >>sro<< >>tmg<< >>tvy<< >>vec<< >>vkp<< >>wln<< >>xfa<< >>xum<<
  • Original Model: opusTCv20210807_transformer-big_2022-08-10.zip
  • Resources for more information:

This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of >>id<< (id = valid target language ID), e.g. >>ast<<

Uses

This model can be used for translation and text-to-text generation.

Risks, Limitations and Biases

CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).

How to Get Started With the Model

A short example code:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>fra<< Charras anglés?",
    ">>fra<< Vull veure't."
]

model_name = "pytorch-models/opus-mt-tc-big-itc-itc"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# expected output:
#     Conversations anglaises ?
#     Je veux te voir.

You can also use OPUS-MT models with the transformers pipelines, for example:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-itc-itc")
print(pipe(">>fra<< Charras anglés?"))

# expected output: Conversations anglaises ?

Training

Evaluation

langpair testset chr-F BLEU #sent #words
cat-fra tatoeba-test-v2021-08-07 0.71201 54.6 700 5664
cat-ita tatoeba-test-v2021-08-07 0.74198 58.4 298 2028
cat-por tatoeba-test-v2021-08-07 0.74930 57.4 747 6119
cat-spa tatoeba-test-v2021-08-07 0.87844 78.1 1534 12094
fra-cat tatoeba-test-v2021-08-07 0.66525 46.2 700 5342
fra-ita tatoeba-test-v2021-08-07 0.72742 53.8 10091 62060
fra-por tatoeba-test-v2021-08-07 0.68413 48.6 10518 77650
fra-ron tatoeba-test-v2021-08-07 0.65009 44.0 1925 12252
fra-spa tatoeba-test-v2021-08-07 0.72080 54.8 10294 78406
glg-por tatoeba-test-v2021-08-07 0.76720 61.1 433 3105
glg-spa tatoeba-test-v2021-08-07 0.82362 71.7 2121 17443
ita-cat tatoeba-test-v2021-08-07 0.72529 56.4 298 2109
ita-fra tatoeba-test-v2021-08-07 0.77932 65.2 10091 66377
ita-por tatoeba-test-v2021-08-07 0.72798 54.0 3066 25668
ita-ron tatoeba-test-v2021-08-07 0.70814 51.1 1005 6209
ita-spa tatoeba-test-v2021-08-07 0.77455 62.9 5000 34937
lad_Latn-spa tatoeba-test-v2021-08-07 0.59363 42.6 239 1239
lad-spa tatoeba-test-v2021-08-07 0.52243 34.7 276 1448
oci-fra tatoeba-test-v2021-08-07 0.49660 29.6 806 6302
pms-ita tatoeba-test-v2021-08-07 0.40221 20.0 232 1721
por-cat tatoeba-test-v2021-08-07 0.71146 52.2 747 6149
por-fra tatoeba-test-v2021-08-07 0.75565 60.9 10518 80459
por-glg tatoeba-test-v2021-08-07 0.75348 59.0 433 3016
por-ita tatoeba-test-v2021-08-07 0.76883 58.8 3066 24897
por-ron tatoeba-test-v2021-08-07 0.67838 46.6 681 4521
por-spa tatoeba-test-v2021-08-07 0.79336 64.8 10947 87335
ron-fra tatoeba-test-v2021-08-07 0.70307 55.0 1925 13347
ron-ita tatoeba-test-v2021-08-07 0.73862 53.7 1005 6352
ron-por tatoeba-test-v2021-08-07 0.70889 50.7 681 4593
ron-spa tatoeba-test-v2021-08-07 0.73529 57.2 1959 12679
spa-cat tatoeba-test-v2021-08-07 0.82758 67.9 1534 12343
spa-fra tatoeba-test-v2021-08-07 0.73113 57.3 10294 83501
spa-glg tatoeba-test-v2021-08-07 0.77332 63.0 2121 16581
spa-ita tatoeba-test-v2021-08-07 0.77046 60.3 5000 34515
spa-lad_Latn tatoeba-test-v2021-08-07 0.40084 14.7 239 1254
spa-por tatoeba-test-v2021-08-07 0.75854 59.1 10947 87610
spa-ron tatoeba-test-v2021-08-07 0.66679 45.5 1959 12503
ast-cat flores101-devtest 0.57870 31.8 1012 27304
ast-fra flores101-devtest 0.56761 31.1 1012 28343
ast-glg flores101-devtest 0.55161 27.9 1012 26582
ast-ita flores101-devtest 0.51764 22.1 1012 27306
ast-oci flores101-devtest 0.49545 20.6 1012 27305
ast-por flores101-devtest 0.57347 31.5 1012 26519
ast-ron flores101-devtest 0.52317 24.8 1012 26799
ast-spa flores101-devtest 0.49741 21.2 1012 29199
cat-ast flores101-devtest 0.56754 24.7 1012 24572
cat-fra flores101-devtest 0.63368 38.4 1012 28343
cat-glg flores101-devtest 0.59596 32.2 1012 26582
cat-ita flores101-devtest 0.55886 26.3 1012 27306
cat-oci flores101-devtest 0.54285 24.6 1012 27305
cat-por flores101-devtest 0.62913 37.7 1012 26519
cat-ron flores101-devtest 0.56885 29.5 1012 26799
cat-spa flores101-devtest 0.53372 24.6 1012 29199
fra-ast flores101-devtest 0.52696 20.7 1012 24572
fra-cat flores101-devtest 0.60492 34.6 1012 27304
fra-glg flores101-devtest 0.57485 30.3 1012 26582
fra-ita flores101-devtest 0.56493 27.3 1012 27306
fra-oci flores101-devtest 0.57449 28.2 1012 27305
fra-por flores101-devtest 0.62211 36.9 1012 26519
fra-ron flores101-devtest 0.56998 29.4 1012 26799
fra-spa flores101-devtest 0.52880 24.2 1012 29199
glg-ast flores101-devtest 0.55090 22.4 1012 24572
glg-cat flores101-devtest 0.60550 32.6 1012 27304
glg-fra flores101-devtest 0.62026 36.0 1012 28343
glg-ita flores101-devtest 0.55834 25.9 1012 27306
glg-oci flores101-devtest 0.52520 21.9 1012 27305
glg-por flores101-devtest 0.60027 32.7 1012 26519
glg-ron flores101-devtest 0.55621 27.8 1012 26799
glg-spa flores101-devtest 0.53219 24.4 1012 29199
ita-ast flores101-devtest 0.50741 17.1 1012 24572
ita-cat flores101-devtest 0.57061 27.9 1012 27304
ita-fra flores101-devtest 0.60199 32.0 1012 28343
ita-glg flores101-devtest 0.55312 25.9 1012 26582
ita-oci flores101-devtest 0.48447 18.1 1012 27305
ita-por flores101-devtest 0.58162 29.0 1012 26519
ita-ron flores101-devtest 0.53703 24.2 1012 26799
ita-spa flores101-devtest 0.52238 23.1 1012 29199
oci-ast flores101-devtest 0.53010 20.2 1012 24572
oci-cat flores101-devtest 0.59946 32.2 1012 27304
oci-fra flores101-devtest 0.64290 39.0 1012 28343
oci-glg flores101-devtest 0.56737 28.0 1012 26582
oci-ita flores101-devtest 0.54220 24.2 1012 27306
oci-por flores101-devtest 0.62127 35.7 1012 26519
oci-ron flores101-devtest 0.55906 28.0 1012 26799
oci-spa flores101-devtest 0.52110 22.8 1012 29199
por-ast flores101-devtest 0.54539 22.5 1012 24572
por-cat flores101-devtest 0.61809 36.4 1012 27304
por-fra flores101-devtest 0.64343 39.7 1012 28343
por-glg flores101-devtest 0.57965 30.4 1012 26582
por-ita flores101-devtest 0.55841 26.3 1012 27306
por-oci flores101-devtest 0.54829 25.3 1012 27305
por-ron flores101-devtest 0.57283 29.8 1012 26799
por-spa flores101-devtest 0.53513 25.2 1012 29199
ron-ast flores101-devtest 0.52265 20.1 1012 24572
ron-cat flores101-devtest 0.59689 32.6 1012 27304
ron-fra flores101-devtest 0.63060 37.4 1012 28343
ron-glg flores101-devtest 0.56677 29.3 1012 26582
ron-ita flores101-devtest 0.55485 25.6 1012 27306
ron-oci flores101-devtest 0.52433 21.8 1012 27305
ron-por flores101-devtest 0.61831 36.1 1012 26519
ron-spa flores101-devtest 0.52712 24.1 1012 29199
spa-ast flores101-devtest 0.49008 15.7 1012 24572
spa-cat flores101-devtest 0.53905 23.2 1012 27304
spa-fra flores101-devtest 0.57078 27.4 1012 28343
spa-glg flores101-devtest 0.52563 22.0 1012 26582
spa-ita flores101-devtest 0.52783 22.3 1012 27306
spa-oci flores101-devtest 0.48064 16.3 1012 27305
spa-por flores101-devtest 0.55736 25.8 1012 26519
spa-ron flores101-devtest 0.51623 21.4 1012 26799
fra-ita newssyscomb2009 0.60995 32.1 502 11551
fra-spa newssyscomb2009 0.60224 34.2 502 12503
ita-fra newssyscomb2009 0.61237 33.7 502 12331
ita-spa newssyscomb2009 0.60706 35.4 502 12503
spa-fra newssyscomb2009 0.61290 34.6 502 12331
spa-ita newssyscomb2009 0.61632 33.3 502 11551
fra-spa news-test2008 0.58939 33.9 2051 52586
spa-fra news-test2008 0.58695 32.4 2051 52685
fra-ita newstest2009 0.59764 31.2 2525 63466
fra-spa newstest2009 0.58829 32.5 2525 68111
ita-fra newstest2009 0.59084 31.6 2525 69263
ita-spa newstest2009 0.59669 33.5 2525 68111
spa-fra newstest2009 0.59096 32.3 2525 69263
spa-ita newstest2009 0.60783 33.2 2525 63466
fra-spa newstest2010 0.62250 37.8 2489 65480
spa-fra newstest2010 0.61953 36.2 2489 66022
fra-spa newstest2011 0.62953 39.8 3003 79476
spa-fra newstest2011 0.61130 34.9 3003 80626
fra-spa newstest2012 0.62397 39.0 3003 79006
spa-fra newstest2012 0.60927 34.3 3003 78011
fra-spa newstest2013 0.59312 34.9 3000 70528
spa-fra newstest2013 0.59468 33.6 3000 70037
cat-ita wmt21-ml-wp 0.69968 47.8 1743 42735
cat-oci wmt21-ml-wp 0.73808 51.6 1743 43736
cat-ron wmt21-ml-wp 0.51178 29.0 1743 42895
ita-cat wmt21-ml-wp 0.70538 48.9 1743 43833
ita-oci wmt21-ml-wp 0.59025 32.0 1743 43736
ita-ron wmt21-ml-wp 0.51261 28.9 1743 42895
oci-cat wmt21-ml-wp 0.80908 66.1 1743 43833
oci-ita wmt21-ml-wp 0.63584 39.6 1743 42735
oci-ron wmt21-ml-wp 0.47384 24.6 1743 42895
ron-cat wmt21-ml-wp 0.52994 31.1 1743 43833
ron-ita wmt21-ml-wp 0.52714 29.6 1743 42735
ron-oci wmt21-ml-wp 0.45932 21.3 1743 43736

Citation Information

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

Acknowledgements

The work is supported by the European Language Grid as pilot project 2866, by the FoTran project, funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 771113), and the MeMAD project, funded by the European Union’s Horizon 2020 Research and Innovation Programme under grant agreement No 780069. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland.

Model conversion info

  • transformers version: 4.16.2
  • OPUS-MT git hash: 8b9f0b0
  • port time: Fri Aug 12 23:57:49 EEST 2022
  • port machine: LM0-400-22516.local
Downloads last month
14
Safetensors
Model size
213M params
Tensor type
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Spaces using Helsinki-NLP/opus-mt-tc-big-itc-itc 7

Evaluation results