tiedeman commited on
Commit
f57877f
1 Parent(s): f645b5e

Initial commit

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.spm filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,241 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ language:
4
+ - aa
5
+ - am
6
+ - ar
7
+ - arc
8
+ - bcw
9
+ - byn
10
+ - cop
11
+ - daa
12
+ - de
13
+ - dsh
14
+ - en
15
+ - gde
16
+ - gnd
17
+ - ha
18
+ - hbo
19
+ - he
20
+ - hig
21
+ - irk
22
+ - jpa
23
+ - kab
24
+ - ker
25
+ - kqp
26
+ - ktb
27
+ - kxc
28
+ - lln
29
+ - lme
30
+ - meq
31
+ - mfh
32
+ - mfi
33
+ - mfk
34
+ - mif
35
+ - mpg
36
+ - mqb
37
+ - mt
38
+ - muy
39
+ - nl
40
+ - oar
41
+ - om
42
+ - pbi
43
+ - phn
44
+ - rif
45
+ - sgw
46
+ - shi
47
+ - shy
48
+ - so
49
+ - sur
50
+ - syc
51
+ - syr
52
+ - taq
53
+ - ti
54
+ - tig
55
+ - tmc
56
+ - tmh
57
+ - tmr
58
+ - ttr
59
+ - tzm
60
+ - wal
61
+ - xed
62
+ - zgh
63
+
64
+ tags:
65
+ - translation
66
+ - opus-mt-tc-bible
67
+
68
+ license: apache-2.0
69
+ model-index:
70
+ - name: opus-mt-tc-bible-big-afa-deu_eng_nld
71
+ results:
72
+ - task:
73
+ name: Translation multi-multi
74
+ type: translation
75
+ args: multi-multi
76
+ dataset:
77
+ name: tatoeba-test-v2020-07-28-v2023-09-26
78
+ type: tatoeba_mt
79
+ args: multi-multi
80
+ metrics:
81
+ - name: BLEU
82
+ type: bleu
83
+ value: 39.9
84
+ - name: chr-F
85
+ type: chrf
86
+ value: 0.57350
87
+ ---
88
+ # opus-mt-tc-bible-big-afa-deu_eng_nld
89
+
90
+ ## Table of Contents
91
+ - [Model Details](#model-details)
92
+ - [Uses](#uses)
93
+ - [Risks, Limitations and Biases](#risks-limitations-and-biases)
94
+ - [How to Get Started With the Model](#how-to-get-started-with-the-model)
95
+ - [Training](#training)
96
+ - [Evaluation](#evaluation)
97
+ - [Citation Information](#citation-information)
98
+ - [Acknowledgements](#acknowledgements)
99
+
100
+ ## Model Details
101
+
102
+ Neural machine translation model for translating from Afro-Asiatic languages (afa) to unknown (deu+eng+nld).
103
+
104
+ This model is part of the [OPUS-MT project](https://github.com/Helsinki-NLP/Opus-MT), an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of [Marian NMT](https://marian-nmt.github.io/), an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from [OPUS](https://opus.nlpl.eu/) and training pipelines use the procedures of [OPUS-MT-train](https://github.com/Helsinki-NLP/Opus-MT-train).
105
+ **Model Description:**
106
+ - **Developed by:** Language Technology Research Group at the University of Helsinki
107
+ - **Model Type:** Translation (transformer-big)
108
+ - **Release**: 2024-08-17
109
+ - **License:** Apache-2.0
110
+ - **Language(s):**
111
+ - Source Language(s): aar acm afb amh apc ara arc arq arz bcw byn cop daa dsh gde gnd hau hbo heb hig irk jpa kab ker kqp ktb kxc lln lme meq mfh mfi mfk mif mlt mpg mqb muy oar orm pbi phn rif sgw shi shy som sur syc syr taq tig tir tmc tmh tmr ttr tzm wal xed zgh
112
+ - Target Language(s): deu eng nld
113
+ - Valid Target Language Labels: >>deu<< >>eng<< >>nld<< >>xxx<<
114
+ - **Original Model**: [opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-17.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/afa-deu+eng+nld/opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-17.zip)
115
+ - **Resources for more information:**
116
+ - [OPUS-MT dashboard](https://opus.nlpl.eu/dashboard/index.php?pkg=opusmt&test=all&scoreslang=all&chart=standard&model=Tatoeba-MT-models/afa-deu%2Beng%2Bnld/opusTCv20230926max50%2Bbt%2Bjhubc_transformer-big_2024-08-17)
117
+ - [OPUS-MT-train GitHub Repo](https://github.com/Helsinki-NLP/OPUS-MT-train)
118
+ - [More information about MarianNMT models in the transformers library](https://huggingface.co/docs/transformers/model_doc/marian)
119
+ - [Tatoeba Translation Challenge](https://github.com/Helsinki-NLP/Tatoeba-Challenge/)
120
+ - [HPLT bilingual data v1 (as part of the Tatoeba Translation Challenge dataset)](https://hplt-project.org/datasets/v1)
121
+ - [A massively parallel Bible corpus](https://aclanthology.org/L14-1215/)
122
+
123
+ This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of `>>id<<` (id = valid target language ID), e.g. `>>deu<<`
124
+
125
+ ## Uses
126
+
127
+ This model can be used for translation and text-to-text generation.
128
+
129
+ ## Risks, Limitations and Biases
130
+
131
+ **CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.**
132
+
133
+ Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
134
+
135
+ ## How to Get Started With the Model
136
+
137
+ A short example code:
138
+
139
+ ```python
140
+ from transformers import MarianMTModel, MarianTokenizer
141
+
142
+ src_text = [
143
+ ">>eng<< هذا هو المكان الذي تعيش فيه.",
144
+ ">>eng<< Amdan yesnulfa-d Ṛebbi akken kan wa ur ineqq wa."
145
+ ]
146
+
147
+ model_name = "pytorch-models/opus-mt-tc-bible-big-afa-deu_eng_nld"
148
+ tokenizer = MarianTokenizer.from_pretrained(model_name)
149
+ model = MarianMTModel.from_pretrained(model_name)
150
+ translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
151
+
152
+ for t in translated:
153
+ print( tokenizer.decode(t, skip_special_tokens=True) )
154
+
155
+ # expected output:
156
+ # This is where you live.
157
+ # The man who had been killed by God didn't kill him.
158
+ ```
159
+
160
+ You can also use OPUS-MT models with the transformers pipelines, for example:
161
+
162
+ ```python
163
+ from transformers import pipeline
164
+ pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-afa-deu_eng_nld")
165
+ print(pipe(">>eng<< هذا هو المكان الذي تعيش فيه."))
166
+
167
+ # expected output: This is where you live.
168
+ ```
169
+
170
+ ## Training
171
+
172
+ - **Data**: opusTCv20230926max50+bt+jhubc ([source](https://github.com/Helsinki-NLP/Tatoeba-Challenge))
173
+ - **Pre-processing**: SentencePiece (spm32k,spm32k)
174
+ - **Model Type:** transformer-big
175
+ - **Original MarianNMT Model**: [opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-17.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/afa-deu+eng+nld/opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-17.zip)
176
+ - **Training Scripts**: [GitHub Repo](https://github.com/Helsinki-NLP/OPUS-MT-train)
177
+
178
+ ## Evaluation
179
+
180
+ * [Model scores at the OPUS-MT dashboard](https://opus.nlpl.eu/dashboard/index.php?pkg=opusmt&test=all&scoreslang=all&chart=standard&model=Tatoeba-MT-models/afa-deu%2Beng%2Bnld/opusTCv20230926max50%2Bbt%2Bjhubc_transformer-big_2024-08-17)
181
+ * test set translations: [opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-17.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/afa-deu+eng+nld/opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-17.test.txt)
182
+ * test set scores: [opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-17.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/afa-deu+eng+nld/opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-17.eval.txt)
183
+ * benchmark results: [benchmark_results.txt](benchmark_results.txt)
184
+ * benchmark output: [benchmark_translations.zip](benchmark_translations.zip)
185
+
186
+ | langpair | testset | chr-F | BLEU | #sent | #words |
187
+ |----------|---------|-------|-------|-------|--------|
188
+ | multi-multi | tatoeba-test-v2020-07-28-v2023-09-26 | 0.57350 | 39.9 | 10000 | 73314 |
189
+
190
+ ## Citation Information
191
+
192
+ * Publications: [Democratizing neural machine translation with OPUS-MT](https://doi.org/10.1007/s10579-023-09704-w) and [OPUS-MT – Building open translation services for the World](https://aclanthology.org/2020.eamt-1.61/) and [The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT](https://aclanthology.org/2020.wmt-1.139/) (Please, cite if you use this model.)
193
+
194
+ ```bibtex
195
+ @article{tiedemann2023democratizing,
196
+ title={Democratizing neural machine translation with {OPUS-MT}},
197
+ author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
198
+ journal={Language Resources and Evaluation},
199
+ number={58},
200
+ pages={713--755},
201
+ year={2023},
202
+ publisher={Springer Nature},
203
+ issn={1574-0218},
204
+ doi={10.1007/s10579-023-09704-w}
205
+ }
206
+
207
+ @inproceedings{tiedemann-thottingal-2020-opus,
208
+ title = "{OPUS}-{MT} {--} Building open translation services for the World",
209
+ author = {Tiedemann, J{\"o}rg and Thottingal, Santhosh},
210
+ booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
211
+ month = nov,
212
+ year = "2020",
213
+ address = "Lisboa, Portugal",
214
+ publisher = "European Association for Machine Translation",
215
+ url = "https://aclanthology.org/2020.eamt-1.61",
216
+ pages = "479--480",
217
+ }
218
+
219
+ @inproceedings{tiedemann-2020-tatoeba,
220
+ title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
221
+ author = {Tiedemann, J{\"o}rg},
222
+ booktitle = "Proceedings of the Fifth Conference on Machine Translation",
223
+ month = nov,
224
+ year = "2020",
225
+ address = "Online",
226
+ publisher = "Association for Computational Linguistics",
227
+ url = "https://aclanthology.org/2020.wmt-1.139",
228
+ pages = "1174--1182",
229
+ }
230
+ ```
231
+
232
+ ## Acknowledgements
233
+
234
+ The work is supported by the [HPLT project](https://hplt-project.org/), funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by [CSC -- IT Center for Science](https://www.csc.fi/), Finland, and the [EuroHPC supercomputer LUMI](https://www.lumi-supercomputer.eu/).
235
+
236
+ ## Model conversion info
237
+
238
+ * transformers version: 4.45.1
239
+ * OPUS-MT git hash: a0ea3b3
240
+ * port time: Mon Oct 7 16:08:13 EEST 2024
241
+ * port machine: LM0-400-22516.local
benchmark_results.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ multi-multi tatoeba-test-v2020-07-28-v2023-09-26 0.57350 39.9 10000 73314
benchmark_translations.zip ADDED
File without changes
config.json ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "pytorch-models/opus-mt-tc-bible-big-afa-deu_eng_nld",
3
+ "activation_dropout": 0.0,
4
+ "activation_function": "relu",
5
+ "architectures": [
6
+ "MarianMTModel"
7
+ ],
8
+ "attention_dropout": 0.0,
9
+ "bos_token_id": 0,
10
+ "classifier_dropout": 0.0,
11
+ "d_model": 1024,
12
+ "decoder_attention_heads": 16,
13
+ "decoder_ffn_dim": 4096,
14
+ "decoder_layerdrop": 0.0,
15
+ "decoder_layers": 6,
16
+ "decoder_start_token_id": 61622,
17
+ "decoder_vocab_size": 61623,
18
+ "dropout": 0.1,
19
+ "encoder_attention_heads": 16,
20
+ "encoder_ffn_dim": 4096,
21
+ "encoder_layerdrop": 0.0,
22
+ "encoder_layers": 6,
23
+ "eos_token_id": 458,
24
+ "forced_eos_token_id": null,
25
+ "init_std": 0.02,
26
+ "is_encoder_decoder": true,
27
+ "max_length": null,
28
+ "max_position_embeddings": 1024,
29
+ "model_type": "marian",
30
+ "normalize_embedding": false,
31
+ "num_beams": null,
32
+ "num_hidden_layers": 6,
33
+ "pad_token_id": 61622,
34
+ "scale_embedding": true,
35
+ "share_encoder_decoder_embeddings": true,
36
+ "static_position_embeddings": true,
37
+ "torch_dtype": "float32",
38
+ "transformers_version": "4.45.1",
39
+ "use_cache": true,
40
+ "vocab_size": 61623
41
+ }
generation_config.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bad_words_ids": [
4
+ [
5
+ 61622
6
+ ]
7
+ ],
8
+ "bos_token_id": 0,
9
+ "decoder_start_token_id": 61622,
10
+ "eos_token_id": 458,
11
+ "forced_eos_token_id": 458,
12
+ "max_length": 512,
13
+ "num_beams": 4,
14
+ "pad_token_id": 61622,
15
+ "transformers_version": "4.45.1"
16
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:62c19ae221385dc466a0350e2fab0d32d35ee000065b6ad58cef93c33f51de6a
3
+ size 958113420
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8345f6fd4e51687d2c1286af427efdcacf4b5762598aa1ac3830358c5bad7505
3
+ size 958164677
source.spm ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:beadb8d688c26da446f7870b87c2a938d399e04bd01b8ecdd2724e39e2f138cc
3
+ size 808833
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>"}
target.spm ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4fe3c81956229cf3d80e56b8d34e6b377dbf09b8fbc442dc48b85a5e155843e5
3
+ size 812809
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"source_lang": "afa", "target_lang": "deu+eng+nld", "unk_token": "<unk>", "eos_token": "</s>", "pad_token": "<pad>", "model_max_length": 512, "sp_model_kwargs": {}, "separate_vocabs": false, "special_tokens_map_file": null, "name_or_path": "marian-models/opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-17/afa-deu+eng+nld", "tokenizer_class": "MarianTokenizer"}
vocab.json ADDED
The diff for this file is too large to render. See raw diff