Fairseq
Basque
Catalan
fdelucaf commited on
Commit
f31b25d
·
verified ·
1 Parent(s): ed7c5f0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -15
README.md CHANGED
@@ -49,19 +49,9 @@ However, we are well aware that our models may be biased. We intend to conduct r
49
 
50
  ### Training data
51
 
52
- The Euskera-Catalan data collected from the web was a combination of the following datasets:
53
-
54
- | Dataset | Sentences before cleaning |
55
- |-------------------|----------------|
56
- | CCMatrix v1 | 1.083.677 |
57
- | XLENT | 219.566 |
58
- | WikiMatrix | 77.233 |
59
- | GNOME | 14.828|
60
- | KDE4 | 93.787 |
61
- | OpenSubtitles | 29.114 |
62
- | Ubuntu| 2.752 |
63
-
64
- The 8.999.391 sentence pairs of synthetic parallel data were created from a random sample
65
  of the [Projecte Aina ES-CA corpus](https://huggingface.co/projecte-aina/mt-aina-ca-es).
66
 
67
  ### Training procedure
@@ -70,7 +60,7 @@ of the [Projecte Aina ES-CA corpus](https://huggingface.co/projecte-aina/mt-aina
70
 
71
  All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
72
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
73
- The filtered datasets are then concatenated to form a final corpus of 10.045.068 and before training the punctuation is normalized using a
74
  modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py).
75
 
76
 
@@ -119,7 +109,7 @@ We use the BLEU score for evaluation on test sets: [Flores-200](https://github.c
119
 
120
  ### Evaluation results
121
 
122
- Below are the evaluation results on the machine translation from Euskera to Catalan compared to [Google Translate](https://translate.google.com/),
123
  [NLLB 200 3.3B](https://huggingface.co/facebook/nllb-200-3.3B) and [ NLLB-200's distilled 1.3B variant](https://huggingface.co/facebook/nllb-200-distilled-1.3B):
124
 
125
  | Test set |Google Translate | NLLB 1.3B | NLLB 3.3 | aina-translator-eu-ca |
 
49
 
50
  ### Training data
51
 
52
+ The Basque-Catalan data is a combination of publicly available bilingual datasets collected from the web.
53
+ These datasets were concatenated before filtering to avoid intra-dataset duplicates.
54
+ Additional 8.999.391 sentence pairs of synthetic parallel data were created from a random sample
 
 
 
 
 
 
 
 
 
 
55
  of the [Projecte Aina ES-CA corpus](https://huggingface.co/projecte-aina/mt-aina-ca-es).
56
 
57
  ### Training procedure
 
60
 
61
  All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
62
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
63
+ The filtered datasets are then concatenated to form a final corpus of **10.045.068** and before training the punctuation is normalized using a
64
  modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py).
65
 
66
 
 
109
 
110
  ### Evaluation results
111
 
112
+ Below are the evaluation results on the machine translation from Basque to Catalan compared to [Google Translate](https://translate.google.com/),
113
  [NLLB 200 3.3B](https://huggingface.co/facebook/nllb-200-3.3B) and [ NLLB-200's distilled 1.3B variant](https://huggingface.co/facebook/nllb-200-distilled-1.3B):
114
 
115
  | Test set |Google Translate | NLLB 1.3B | NLLB 3.3 | aina-translator-eu-ca |