tiedeman commited on
Commit
08c7279
1 Parent(s): c116a51

Initial commit

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.spm filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,271 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ language:
4
+ - aai
5
+ - aia
6
+ - alp
7
+ - amk
8
+ - aoz
9
+ - apr
10
+ - aui
11
+ - bmk
12
+ - bnp
13
+ - buk
14
+ - bzh
15
+ - dad
16
+ - de
17
+ - dob
18
+ - dww
19
+ - emi
20
+ - en
21
+ - es
22
+ - far
23
+ - fj
24
+ - fr
25
+ - frd
26
+ - gfk
27
+ - gil
28
+ - haw
29
+ - hla
30
+ - hot
31
+ - hvn
32
+ - kbm
33
+ - khz
34
+ - kje
35
+ - kpg
36
+ - kqf
37
+ - kqw
38
+ - kud
39
+ - kwf
40
+ - lcm
41
+ - leu
42
+ - lex
43
+ - lid
44
+ - mee
45
+ - mek
46
+ - mgm
47
+ - mh
48
+ - mi
49
+ - mmo
50
+ - mmx
51
+ - mna
52
+ - mox
53
+ - mpx
54
+ - mva
55
+ - mwc
56
+ - myw
57
+ - mzz
58
+ - na
59
+ - nak
60
+ - niu
61
+ - nsn
62
+ - nss
63
+ - nwi
64
+ - pt
65
+ - ptp
66
+ - pwg
67
+ - rai
68
+ - rap
69
+ - rro
70
+ - rug
71
+ - sgz
72
+ - sm
73
+ - snc
74
+ - sps
75
+ - stn
76
+ - swp
77
+ - tbc
78
+ - tbo
79
+ - tet
80
+ - tgo
81
+ - tgp
82
+ - tkl
83
+ - tlx
84
+ - to
85
+ - tpa
86
+ - tpz
87
+ - tte
88
+ - tuc
89
+ - tvl
90
+ - twu
91
+ - ty
92
+ - ubr
93
+ - uvl
94
+ - viv
95
+ - wed
96
+ - wuv
97
+ - xsi
98
+ - yml
99
+
100
+ tags:
101
+ - translation
102
+ - opus-mt-tc-bible
103
+
104
+ license: apache-2.0
105
+ model-index:
106
+ - name: opus-mt-tc-bible-big-pqe-deu_eng_fra_por_spa
107
+ results:
108
+ - task:
109
+ name: Translation multi-multi
110
+ type: translation
111
+ args: multi-multi
112
+ dataset:
113
+ name: tatoeba-test-v2020-07-28-v2023-09-26
114
+ type: tatoeba_mt
115
+ args: multi-multi
116
+ metrics:
117
+ - name: BLEU
118
+ type: bleu
119
+ value: 22.2
120
+ - name: chr-F
121
+ type: chrf
122
+ value: 0.36673
123
+ ---
124
+ # opus-mt-tc-bible-big-pqe-deu_eng_fra_por_spa
125
+
126
+ ## Table of Contents
127
+ - [Model Details](#model-details)
128
+ - [Uses](#uses)
129
+ - [Risks, Limitations and Biases](#risks-limitations-and-biases)
130
+ - [How to Get Started With the Model](#how-to-get-started-with-the-model)
131
+ - [Training](#training)
132
+ - [Evaluation](#evaluation)
133
+ - [Citation Information](#citation-information)
134
+ - [Acknowledgements](#acknowledgements)
135
+
136
+ ## Model Details
137
+
138
+ Neural machine translation model for translating from Eastern Malayo-Polynesian languages (pqe) to unknown (deu+eng+fra+por+spa).
139
+
140
+ This model is part of the [OPUS-MT project](https://github.com/Helsinki-NLP/Opus-MT), an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of [Marian NMT](https://marian-nmt.github.io/), an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from [OPUS](https://opus.nlpl.eu/) and training pipelines use the procedures of [OPUS-MT-train](https://github.com/Helsinki-NLP/Opus-MT-train).
141
+ **Model Description:**
142
+ - **Developed by:** Language Technology Research Group at the University of Helsinki
143
+ - **Model Type:** Translation (transformer-big)
144
+ - **Release**: 2024-05-30
145
+ - **License:** Apache-2.0
146
+ - **Language(s):**
147
+ - Source Language(s): aai aia alp amk aoz apr aui bmk bnp buk bzh dad dob dww emi far fij frd gfk gil haw hla hot hvn kbm khz kje kpg kqf kqw kud kwf lcm leu lex lid mah mee mek mgm mmo mmx mna mox mpx mri mva mwc myw mzz nak nau niu nsn nss nwi ptp pwg rai rap rro rug sgz smo snc sps stn swp tah tbc tbo tet tgo tgp tkl tlx ton tpa tpz tte tuc tvl twu ubr uvl viv wed wuv xsi yml
148
+ - Target Language(s): deu eng fra por spa
149
+ - Valid Target Language Labels: >>deu<< >>eng<< >>fra<< >>por<< >>spa<< >>xxx<<
150
+ - **Original Model**: [opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-30.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/pqe-deu+eng+fra+por+spa/opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-30.zip)
151
+ - **Resources for more information:**
152
+ - [OPUS-MT dashboard](https://opus.nlpl.eu/dashboard/index.php?pkg=opusmt&test=all&scoreslang=all&chart=standard&model=Tatoeba-MT-models/pqe-deu%2Beng%2Bfra%2Bpor%2Bspa/opusTCv20230926max50%2Bbt%2Bjhubc_transformer-big_2024-05-30)
153
+ - [OPUS-MT-train GitHub Repo](https://github.com/Helsinki-NLP/OPUS-MT-train)
154
+ - [More information about MarianNMT models in the transformers library](https://huggingface.co/docs/transformers/model_doc/marian)
155
+ - [Tatoeba Translation Challenge](https://github.com/Helsinki-NLP/Tatoeba-Challenge/)
156
+ - [HPLT bilingual data v1 (as part of the Tatoeba Translation Challenge dataset)](https://hplt-project.org/datasets/v1)
157
+ - [A massively parallel Bible corpus](https://aclanthology.org/L14-1215/)
158
+
159
+ This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of `>>id<<` (id = valid target language ID), e.g. `>>deu<<`
160
+
161
+ ## Uses
162
+
163
+ This model can be used for translation and text-to-text generation.
164
+
165
+ ## Risks, Limitations and Biases
166
+
167
+ **CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.**
168
+
169
+ Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
170
+
171
+ ## How to Get Started With the Model
172
+
173
+ A short example code:
174
+
175
+ ```python
176
+ from transformers import MarianMTModel, MarianTokenizer
177
+
178
+ src_text = [
179
+ ">>deu<< Replace this with text in an accepted source language.",
180
+ ">>spa<< This is the second sentence."
181
+ ]
182
+
183
+ model_name = "pytorch-models/opus-mt-tc-bible-big-pqe-deu_eng_fra_por_spa"
184
+ tokenizer = MarianTokenizer.from_pretrained(model_name)
185
+ model = MarianMTModel.from_pretrained(model_name)
186
+ translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
187
+
188
+ for t in translated:
189
+ print( tokenizer.decode(t, skip_special_tokens=True) )
190
+ ```
191
+
192
+ You can also use OPUS-MT models with the transformers pipelines, for example:
193
+
194
+ ```python
195
+ from transformers import pipeline
196
+ pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-pqe-deu_eng_fra_por_spa")
197
+ print(pipe(">>deu<< Replace this with text in an accepted source language."))
198
+ ```
199
+
200
+ ## Training
201
+
202
+ - **Data**: opusTCv20230926max50+bt+jhubc ([source](https://github.com/Helsinki-NLP/Tatoeba-Challenge))
203
+ - **Pre-processing**: SentencePiece (spm32k,spm32k)
204
+ - **Model Type:** transformer-big
205
+ - **Original MarianNMT Model**: [opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-30.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/pqe-deu+eng+fra+por+spa/opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-30.zip)
206
+ - **Training Scripts**: [GitHub Repo](https://github.com/Helsinki-NLP/OPUS-MT-train)
207
+
208
+ ## Evaluation
209
+
210
+ * [Model scores at the OPUS-MT dashboard](https://opus.nlpl.eu/dashboard/index.php?pkg=opusmt&test=all&scoreslang=all&chart=standard&model=Tatoeba-MT-models/pqe-deu%2Beng%2Bfra%2Bpor%2Bspa/opusTCv20230926max50%2Bbt%2Bjhubc_transformer-big_2024-05-30)
211
+ * test set translations: [opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/pqe-deu+eng+fra+por+spa/opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.test.txt)
212
+ * test set scores: [opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/pqe-deu+eng+fra+por+spa/opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.eval.txt)
213
+ * benchmark results: [benchmark_results.txt](benchmark_results.txt)
214
+ * benchmark output: [benchmark_translations.zip](benchmark_translations.zip)
215
+
216
+ | langpair | testset | chr-F | BLEU | #sent | #words |
217
+ |----------|---------|-------|-------|-------|--------|
218
+ | multi-multi | tatoeba-test-v2020-07-28-v2023-09-26 | 0.36673 | 22.2 | 860 | 5170 |
219
+
220
+ ## Citation Information
221
+
222
+ * Publications: [Democratizing neural machine translation with OPUS-MT](https://doi.org/10.1007/s10579-023-09704-w) and [OPUS-MT – Building open translation services for the World](https://aclanthology.org/2020.eamt-1.61/) and [The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT](https://aclanthology.org/2020.wmt-1.139/) (Please, cite if you use this model.)
223
+
224
+ ```bibtex
225
+ @article{tiedemann2023democratizing,
226
+ title={Democratizing neural machine translation with {OPUS-MT}},
227
+ author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
228
+ journal={Language Resources and Evaluation},
229
+ number={58},
230
+ pages={713--755},
231
+ year={2023},
232
+ publisher={Springer Nature},
233
+ issn={1574-0218},
234
+ doi={10.1007/s10579-023-09704-w}
235
+ }
236
+
237
+ @inproceedings{tiedemann-thottingal-2020-opus,
238
+ title = "{OPUS}-{MT} {--} Building open translation services for the World",
239
+ author = {Tiedemann, J{\"o}rg and Thottingal, Santhosh},
240
+ booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
241
+ month = nov,
242
+ year = "2020",
243
+ address = "Lisboa, Portugal",
244
+ publisher = "European Association for Machine Translation",
245
+ url = "https://aclanthology.org/2020.eamt-1.61",
246
+ pages = "479--480",
247
+ }
248
+
249
+ @inproceedings{tiedemann-2020-tatoeba,
250
+ title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
251
+ author = {Tiedemann, J{\"o}rg},
252
+ booktitle = "Proceedings of the Fifth Conference on Machine Translation",
253
+ month = nov,
254
+ year = "2020",
255
+ address = "Online",
256
+ publisher = "Association for Computational Linguistics",
257
+ url = "https://aclanthology.org/2020.wmt-1.139",
258
+ pages = "1174--1182",
259
+ }
260
+ ```
261
+
262
+ ## Acknowledgements
263
+
264
+ The work is supported by the [HPLT project](https://hplt-project.org/), funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by [CSC -- IT Center for Science](https://www.csc.fi/), Finland, and the [EuroHPC supercomputer LUMI](https://www.lumi-supercomputer.eu/).
265
+
266
+ ## Model conversion info
267
+
268
+ * transformers version: 4.45.1
269
+ * OPUS-MT git hash: 0882077
270
+ * port time: Tue Oct 8 13:22:51 EEST 2024
271
+ * port machine: LM0-400-22516.local
benchmark_results.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ multi-multi tatoeba-test-v2020-07-28-v2023-09-26 0.36673 22.2 860 5170
benchmark_translations.zip ADDED
File without changes
config.json ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "pytorch-models/opus-mt-tc-bible-big-pqe-deu_eng_fra_por_spa",
3
+ "activation_dropout": 0.0,
4
+ "activation_function": "relu",
5
+ "architectures": [
6
+ "MarianMTModel"
7
+ ],
8
+ "attention_dropout": 0.0,
9
+ "bos_token_id": 0,
10
+ "classifier_dropout": 0.0,
11
+ "d_model": 1024,
12
+ "decoder_attention_heads": 16,
13
+ "decoder_ffn_dim": 4096,
14
+ "decoder_layerdrop": 0.0,
15
+ "decoder_layers": 6,
16
+ "decoder_start_token_id": 61709,
17
+ "decoder_vocab_size": 61710,
18
+ "dropout": 0.1,
19
+ "encoder_attention_heads": 16,
20
+ "encoder_ffn_dim": 4096,
21
+ "encoder_layerdrop": 0.0,
22
+ "encoder_layers": 6,
23
+ "eos_token_id": 136,
24
+ "forced_eos_token_id": null,
25
+ "init_std": 0.02,
26
+ "is_encoder_decoder": true,
27
+ "max_length": null,
28
+ "max_position_embeddings": 1024,
29
+ "model_type": "marian",
30
+ "normalize_embedding": false,
31
+ "num_beams": null,
32
+ "num_hidden_layers": 6,
33
+ "pad_token_id": 61709,
34
+ "scale_embedding": true,
35
+ "share_encoder_decoder_embeddings": true,
36
+ "static_position_embeddings": true,
37
+ "torch_dtype": "float32",
38
+ "transformers_version": "4.45.1",
39
+ "use_cache": true,
40
+ "vocab_size": 61710
41
+ }
generation_config.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bad_words_ids": [
4
+ [
5
+ 61709
6
+ ]
7
+ ],
8
+ "bos_token_id": 0,
9
+ "decoder_start_token_id": 61709,
10
+ "eos_token_id": 136,
11
+ "forced_eos_token_id": 136,
12
+ "max_length": 512,
13
+ "num_beams": 4,
14
+ "pad_token_id": 61709,
15
+ "transformers_version": "4.45.1"
16
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:504191d8318619ff6950314d3128fe7e60de15c98dd437dbf329b4fd57ae27c4
3
+ size 958470120
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6eb31bf921cf1419eeca094afc5f098d6f50befd92889c9a6cb015134310bb87
3
+ size 958521349
source.spm ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3b4f9d79c81b87bb40bfe897e2bbcbaa5dc18621a6326d715433467974554177
3
+ size 790692
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>"}
target.spm ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e209e198e489f730281f3dd3155295c023b8c2975c8b9f873068ac350cbad0ed
3
+ size 819946
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"source_lang": "pqe", "target_lang": "deu+eng+fra+por+spa", "unk_token": "<unk>", "eos_token": "</s>", "pad_token": "<pad>", "model_max_length": 512, "sp_model_kwargs": {}, "separate_vocabs": false, "special_tokens_map_file": null, "name_or_path": "marian-models/opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-30/pqe-deu+eng+fra+por+spa", "tokenizer_class": "MarianTokenizer"}
vocab.json ADDED
The diff for this file is too large to render. See raw diff