projecte-aina
/

aina-translator-eu-ca

Fairseq

Basque

Catalan

Model card Files Files and versions Community

fdelucaf commited on Jun 21, 2024

Commit

f31b25d

verified ·

1 Parent(s): ed7c5f0

Update README.md

Browse files

Files changed (1) hide show

README.md +5 -15

README.md CHANGED Viewed

@@ -49,19 +49,9 @@ However, we are well aware that our models may be biased. We intend to conduct r
 ### Training data
-The Euskera-Catalan data collected from the web was a combination of the following datasets:
-| Dataset       	| Sentences before cleaning	|
-|-------------------|----------------|
-| CCMatrix  v1  	| 1.083.677  	|
-| XLENT	| 219.566	|
-| WikiMatrix  	| 77.233	|
-| GNOME	| 14.828|
-| KDE4    	| 93.787 	|
-| OpenSubtitles	| 29.114 |
-| Ubuntu| 2.752 	|
-The 8.999.391 sentence pairs of synthetic parallel data were created from a random sample
 of the [Projecte Aina ES-CA corpus](https://huggingface.co/projecte-aina/mt-aina-ca-es).
 ### Training procedure
@@ -70,7 +60,7 @@ of the [Projecte Aina ES-CA corpus](https://huggingface.co/projecte-aina/mt-aina
  All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
- The filtered datasets are then concatenated to form a final corpus of 10.045.068 and before training the punctuation is normalized using a
  modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py).
@@ -119,7 +109,7 @@ We use the BLEU score for evaluation on test sets: [Flores-200](https://github.c
 ### Evaluation results
-Below are the evaluation results on the machine translation from Euskera to Catalan compared to [Google Translate](https://translate.google.com/),
 [NLLB 200 3.3B](https://huggingface.co/facebook/nllb-200-3.3B) and [ NLLB-200's distilled 1.3B variant](https://huggingface.co/facebook/nllb-200-distilled-1.3B):
 | Test set         	|Google Translate | NLLB 1.3B | NLLB 3.3 | aina-translator-eu-ca |

 ### Training data
+The Basque-Catalan data is a combination of publicly available bilingual datasets collected from the web.
+These datasets were concatenated before filtering to avoid intra-dataset duplicates.
+Additional 8.999.391 sentence pairs of synthetic parallel data were created from a random sample
 of the [Projecte Aina ES-CA corpus](https://huggingface.co/projecte-aina/mt-aina-ca-es).
 ### Training procedure
  All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
+ The filtered datasets are then concatenated to form a final corpus of **10.045.068** and before training the punctuation is normalized using a
  modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py).
 ### Evaluation results
+Below are the evaluation results on the machine translation from Basque to Catalan compared to [Google Translate](https://translate.google.com/),
 [NLLB 200 3.3B](https://huggingface.co/facebook/nllb-200-3.3B) and [ NLLB-200's distilled 1.3B variant](https://huggingface.co/facebook/nllb-200-distilled-1.3B):
 | Test set         	|Google Translate | NLLB 1.3B | NLLB 3.3 | aina-translator-eu-ca |