Update README.md
Browse files
README.md
CHANGED
@@ -49,19 +49,9 @@ However, we are well aware that our models may be biased. We intend to conduct r
|
|
49 |
|
50 |
### Training data
|
51 |
|
52 |
-
The
|
53 |
-
|
54 |
-
|
55 |
-
|-------------------|----------------|
|
56 |
-
| CCMatrix v1 | 1.083.677 |
|
57 |
-
| XLENT | 219.566 |
|
58 |
-
| WikiMatrix | 77.233 |
|
59 |
-
| GNOME | 14.828|
|
60 |
-
| KDE4 | 93.787 |
|
61 |
-
| OpenSubtitles | 29.114 |
|
62 |
-
| Ubuntu| 2.752 |
|
63 |
-
|
64 |
-
The 8.999.391 sentence pairs of synthetic parallel data were created from a random sample
|
65 |
of the [Projecte Aina ES-CA corpus](https://huggingface.co/projecte-aina/mt-aina-ca-es).
|
66 |
|
67 |
### Training procedure
|
@@ -70,7 +60,7 @@ of the [Projecte Aina ES-CA corpus](https://huggingface.co/projecte-aina/mt-aina
|
|
70 |
|
71 |
All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
|
72 |
This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
|
73 |
-
The filtered datasets are then concatenated to form a final corpus of 10.045.068 and before training the punctuation is normalized using a
|
74 |
modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py).
|
75 |
|
76 |
|
@@ -119,7 +109,7 @@ We use the BLEU score for evaluation on test sets: [Flores-200](https://github.c
|
|
119 |
|
120 |
### Evaluation results
|
121 |
|
122 |
-
Below are the evaluation results on the machine translation from
|
123 |
[NLLB 200 3.3B](https://huggingface.co/facebook/nllb-200-3.3B) and [ NLLB-200's distilled 1.3B variant](https://huggingface.co/facebook/nllb-200-distilled-1.3B):
|
124 |
|
125 |
| Test set |Google Translate | NLLB 1.3B | NLLB 3.3 | aina-translator-eu-ca |
|
|
|
49 |
|
50 |
### Training data
|
51 |
|
52 |
+
The Basque-Catalan data is a combination of publicly available bilingual datasets collected from the web.
|
53 |
+
These datasets were concatenated before filtering to avoid intra-dataset duplicates.
|
54 |
+
Additional 8.999.391 sentence pairs of synthetic parallel data were created from a random sample
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
55 |
of the [Projecte Aina ES-CA corpus](https://huggingface.co/projecte-aina/mt-aina-ca-es).
|
56 |
|
57 |
### Training procedure
|
|
|
60 |
|
61 |
All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
|
62 |
This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
|
63 |
+
The filtered datasets are then concatenated to form a final corpus of **10.045.068** and before training the punctuation is normalized using a
|
64 |
modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py).
|
65 |
|
66 |
|
|
|
109 |
|
110 |
### Evaluation results
|
111 |
|
112 |
+
Below are the evaluation results on the machine translation from Basque to Catalan compared to [Google Translate](https://translate.google.com/),
|
113 |
[NLLB 200 3.3B](https://huggingface.co/facebook/nllb-200-3.3B) and [ NLLB-200's distilled 1.3B variant](https://huggingface.co/facebook/nllb-200-distilled-1.3B):
|
114 |
|
115 |
| Test set |Google Translate | NLLB 1.3B | NLLB 3.3 | aina-translator-eu-ca |
|