Fairseq
Basque
Catalan
fdelucaf commited on
Commit
4a72808
·
verified ·
1 Parent(s): 499a57d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +61 -43
README.md CHANGED
@@ -1,32 +1,19 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
 
 
3
  ---
4
  ## Projecte Aina’s Basque-Catalan machine translation model
5
 
6
- ## Table of Contents
7
- - [Model Description](#model-description)
8
- - [Intended Uses and Limitations](#intended-use)
9
- - [How to Use](#how-to-use)
10
- - [Training](#training)
11
- - [Training data](#training-data)
12
- - [Training procedure](#training-procedure)
13
- - [Data Preparation](#data-preparation)
14
- - [Tokenization](#tokenization)
15
- - [Hyperparameters](#hyperparameters)
16
- - [Evaluation](#evaluation)
17
- - [Variable and Metrics](#variable-and-metrics)
18
- - [Evaluation Results](#evaluation-results)
19
- - [Additional Information](#additional-information)
20
- - [Author](#author)
21
- - [Contact Information](#contact-information)
22
- - [Copyright](#copyright)
23
- - [Licensing Information](#licensing-information)
24
- - [Funding](#funding)
25
- - [Disclaimer](#disclaimer)
26
-
27
  ## Model description
28
 
29
- This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Basque-Catalan datasets totalling 10.045.068 sentence pairs. 1.045.677 sentence pairs were parallel data collected from the web while the remaining 8.999.391 sentence pairs were parallel synthetic data created using the ES-EU translator of [HiTZ](http://hitz.eus/). The model was evaluated on the Flores, TaCon and NTREX evaluation datasets.
 
 
30
 
31
  ## Intended uses and limitations
32
 
@@ -46,7 +33,7 @@ Translate a sentence using python
46
  import ctranslate2
47
  import pyonmttok
48
  from huggingface_hub import snapshot_download
49
- model_dir = snapshot_download(repo_id="projecte-aina/mt-aina-eu-ca", revision="main")
50
  tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model")
51
  tokenized=tokenizer.tokenize("Ongi etorri Aina proiektura.")
52
  translator = ctranslate2.Translator(model_dir)
@@ -54,6 +41,10 @@ translated = translator.translate_batch([tokenized[0]])
54
  print(tokenizer.detokenize(translated[0][0]['tokens']))
55
  ```
56
 
 
 
 
 
57
  ## Training
58
 
59
  ### Training data
@@ -73,18 +64,23 @@ The Euskera-Catalan data collected from the web was a combination of the followi
73
  | Ubuntu| 2.752 |
74
  | **Total** | **1.531.980** |
75
 
76
- The 8.999.391 sentence pairs of synthetic parallel data were created from a random sample of the [Projecte Aina ES-CA corpus](https://huggingface.co/projecte-aina/mt-aina-ca-es)
 
77
 
78
  ### Training procedure
79
 
80
  ### Data preparation
81
 
82
- All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75. This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE). The filtered datasets are then concatenated to form a final corpus of 10.045.068 and before training the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 
 
 
83
 
84
 
85
  #### Tokenization
86
 
87
- All data is tokenized using sentencepiece, with a 50 thousand token sentencepiece model learned from the combination of all filtered training data. This model is included.
 
88
 
89
  #### Hyperparameters
90
 
@@ -117,36 +113,58 @@ This data was then concatenated with the synthetic parallel data and training co
117
  Weights were saved every 1000 updates and reported results are the average of the last 4 checkpoints.
118
 
119
  ## Evaluation
 
120
  ### Variable and metrics
121
- We use the BLEU score for evaluation on test sets: [Flores-200](https://github.com/facebookresearch/flores/tree/main/flores200), [TaCon](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/) and [NTREX](https://github.com/MicrosoftTranslator/NTREX)
 
 
 
 
122
  ### Evaluation results
123
- Below are the evaluation results on the machine translation from Euskera to Catalan compared to [Google Translate](https://translate.google.com/), [NLLB 200 3.3B](https://huggingface.co/facebook/nllb-200-3.3B) and [ NLLB-200's distilled 1.3B variant](https://huggingface.co/facebook/nllb-200-distilled-1.3B):
124
- | Test set |Google Translate | NLLB 1.3B | NLLB 3.3 |mt-aina-eu-ca|
 
 
 
125
  |----------------------|--|------------|------------------|---------------|
126
  | Flores 200 devtest |**29,8**| 17,7 | 26,5 | 26,1 |
127
  | TaCON | 25,6|15,2 | 24,2 | **27,3** |
128
  | NTREX |**27,2**|15,8 | 25,3 | 24,3 |
129
  | Average |**28,4**| 16,2 | 25,3 | 25,9 |
 
130
  ## Additional information
 
131
  ### Author
132
- Language Technologies Unit (LangTech) at the Barcelona Supercomputing Center
133
- ### Contact information
134
- For further information, send an email to <[email protected]>
 
 
135
  ### Copyright
136
- Copyright Language Technologies Unit at Barcelona Supercomputing Center (2023)
137
- ### Licensing information
138
- This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
 
 
139
  ### Funding
140
- This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the [project ILENIA](https://proyectoilenia.es/) with reference 2022/TL22/00215337, 2022/TL22/00215336, 2022/TL22/00215335 y 2022/TL22/00215334
141
- ## Limitations and Bias
142
- At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
 
143
  ### Disclaimer
 
144
  <details>
145
  <summary>Click to expand</summary>
146
- The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
147
- When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
148
- In no event shall the owner and creator of the models (BSC – Barcelona Supercomputing Center) be liable for any results arising from the use made by third parties of these models.
149
- </details>
150
 
 
 
 
 
 
 
 
151
 
 
 
152
 
 
 
1
  ---
2
+ license: apache-2.0
3
+ language:
4
+ - eu
5
+ - ca
6
+ metrics:
7
+ - bleu
8
+ library_name: fairseq
9
  ---
10
  ## Projecte Aina’s Basque-Catalan machine translation model
11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ## Model description
13
 
14
+ This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Basque-Catalan datasets
15
+ totalling 10.045.068 sentence pairs. 1.045.677 sentence pairs were parallel data collected from the web while the remaining 8.999.391 sentence pairs
16
+ were parallel synthetic data created using the ES-EU translator of [HiTZ](http://hitz.eus/). The model was evaluated on the Flores, TaCon and NTREX evaluation datasets.
17
 
18
  ## Intended uses and limitations
19
 
 
33
  import ctranslate2
34
  import pyonmttok
35
  from huggingface_hub import snapshot_download
36
+ model_dir = snapshot_download(repo_id="projecte-aina/aina-translator-eu-ca", revision="main")
37
  tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model")
38
  tokenized=tokenizer.tokenize("Ongi etorri Aina proiektura.")
39
  translator = ctranslate2.Translator(model_dir)
 
41
  print(tokenizer.detokenize(translated[0][0]['tokens']))
42
  ```
43
 
44
+ ## Limitations and bias
45
+ At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
46
+ However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
47
+
48
  ## Training
49
 
50
  ### Training data
 
64
  | Ubuntu| 2.752 |
65
  | **Total** | **1.531.980** |
66
 
67
+ The 8.999.391 sentence pairs of synthetic parallel data were created from a random sample
68
+ of the [Projecte Aina ES-CA corpus](https://huggingface.co/projecte-aina/mt-aina-ca-es).
69
 
70
  ### Training procedure
71
 
72
  ### Data preparation
73
 
74
+ All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
75
+ This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
76
+ The filtered datasets are then concatenated to form a final corpus of 10.045.068 and before training the punctuation is normalized using a
77
+ modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py).
78
 
79
 
80
  #### Tokenization
81
 
82
+ All data is tokenized using sentencepiece, with a 50 thousand token sentencepiece model learned from the combination of all filtered training data.
83
+ This model is included.
84
 
85
  #### Hyperparameters
86
 
 
113
  Weights were saved every 1000 updates and reported results are the average of the last 4 checkpoints.
114
 
115
  ## Evaluation
116
+
117
  ### Variable and metrics
118
+
119
+ We use the BLEU score for evaluation on test sets: [Flores-200](https://github.com/facebookresearch/flores/tree/main/flores200),
120
+ [TaCon](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/) and
121
+ [NTREX](https://github.com/MicrosoftTranslator/NTREX).
122
+
123
  ### Evaluation results
124
+
125
+ Below are the evaluation results on the machine translation from Euskera to Catalan compared to [Google Translate](https://translate.google.com/),
126
+ [NLLB 200 3.3B](https://huggingface.co/facebook/nllb-200-3.3B) and [ NLLB-200's distilled 1.3B variant](https://huggingface.co/facebook/nllb-200-distilled-1.3B):
127
+
128
+ | Test set |Google Translate | NLLB 1.3B | NLLB 3.3 | aina-translator-eu-ca |
129
  |----------------------|--|------------|------------------|---------------|
130
  | Flores 200 devtest |**29,8**| 17,7 | 26,5 | 26,1 |
131
  | TaCON | 25,6|15,2 | 24,2 | **27,3** |
132
  | NTREX |**27,2**|15,8 | 25,3 | 24,3 |
133
  | Average |**28,4**| 16,2 | 25,3 | 25,9 |
134
+
135
  ## Additional information
136
+
137
  ### Author
138
+ The Language Technologies Unit from Barcelona Supercomputing Center.
139
+
140
+ ### Contact
141
+ For further information, please send an email to <[email protected]>.
142
+
143
  ### Copyright
144
+ Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.
145
+
146
+ ### License
147
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
148
+
149
  ### Funding
150
+ This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU
151
+ within the framework of the [project ILENIA](https://proyectoilenia.es/)
152
+ with reference 2022/TL22/00215337, 2022/TL22/00215336, 2022/TL22/00215335 y 2022/TL22/00215334
153
+
154
  ### Disclaimer
155
+
156
  <details>
157
  <summary>Click to expand</summary>
 
 
 
 
158
 
159
+ The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.
160
+
161
+ Be aware that the model may have biases and/or any other undesirable distortions.
162
+
163
+ When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it)
164
+ or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and,
165
+ in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
166
 
167
+ In no event shall the owner and creator of the model (Barcelona Supercomputing Center)
168
+ be liable for any results arising from the use made by third parties.
169
 
170
+ </details>