fdelucaf commited on
Commit
601ecba
·
verified ·
1 Parent(s): 928c3fc

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -0
README.md ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ datasets:
4
+ - projecte-aina/ES-AN_Parallel_Corpus
5
+ language:
6
+ - es
7
+ - an
8
+ metrics:
9
+ - bleu
10
+ - chrf
11
+ library_name: transformers
12
+ ---
13
+ ## Projecte Aina’s Spanish-Aragonese machine translation model
14
+
15
+ ## Model description
16
+
17
+ This model was created as part of the participation of Language Technologies Unit at BSC in the WMT24 Shared Task:
18
+ [Translation into Low-Resource Languages of Spain](https://www2.statmt.org/wmt24/romance-task.html).
19
+ It results from a full fine-tuning of the [NLLB-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) model with a Spanish-Aragonese corpus.
20
+ Specifically, we used the [transformers library](https://huggingface.co/docs/transformers/) from Hugging Face and a filtered version
21
+ of the [Spanish-Aragonese dataset](https://huggingface.co/datasets/projecte-aina/ES-AN_Parallel_Corpus) to fine-tune the model. Since the original NLLB-200-600M doesn't support Aragonese, we added a new token ("arg_Latn") to enable translation into Aragonese. This language tag helps the model recognize the source and target languages for translation.
22
+ The model was evaluated using the Flores+ evaluation datasets.
23
+ Please refer to the [paper](__poner_link___) for more information.
24
+
25
+ ## Intended uses and limitations
26
+
27
+ You can use this model for machine translation from Spanish to Aragonese.
28
+
29
+ ## Limitations and bias
30
+ At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
31
+ However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
32
+
33
+ ## Evaluation
34
+
35
+ ### Variable and metrics
36
+
37
+ We use the BLEU and ChrF score for evaluation on the [Flores+](https://github.com/openlanguagedata/flores) evaluation datasets.
38
+
39
+ ### Evaluation results
40
+
41
+ Below are the evaluation results on the machine translation from Spanish to Aragonese compared to [Apertium](https://www.apertium.org/), [Softcatala](https://www.softcatala.org/traductor/) (cascading through Catalan) and [Traduze](https://traduze.aragon.es):
42
+
43
+ | Test set (BLEU) | Apertium | Softcatala | Traduze | Our model |
44
+ |:---------------------|:---------|:-------|:----------|:-----------|
45
+ | Flores dev | 65.34 | 50.21 | 37.43 | **71.14** |
46
+ | Flores devtest | 61.11 | 47.08 | 35.47 | **62.32** |
47
+
48
+ | Test set (ChrF) | Apertium | Softcatala | Traduze | Our model |
49
+ |:---------------------|:---------|:-------|:----------|:-----------|
50
+ | Flores dev | 82.00 | 73.97 | 69.51 | **84.63** |
51
+ | Flores devtest | 79.31 | 71.99 | 67.66 | **79.88** |
52
+
53
+ ## Additional information
54
+
55
+ ## Paper
56
+ For further information, please refer to the [paper](__poner_link___) published for the Shared Task: Translation into Low-Resource Languages of Spain (WMT24)
57
+
58
+ ### Author
59
+ The Language Technologies Unit from Barcelona Supercomputing Center.
60
+
61
+ ### Contact
62
+ For further information, please send an email to <[email protected]>.
63
+
64
+ ### Copyright
65
+ Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center.
66
+
67
+ ### License
68
+ [CC-BY-NC-4.0](https://creativecommons.org/licenses/by-nc/4.0/)
69
+
70
+ ### Disclaimer
71
+
72
+ <details>
73
+ <summary>Click to expand</summary>
74
+
75
+ The model published in this repository is intended for a generalist purpose and is available to third parties under a CC BY-NC 4.0 license.
76
+
77
+ Be aware that the model may have biases and/or any other undesirable distortions.
78
+
79
+ When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it)
80
+ or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and,
81
+ in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
82
+
83
+ In no event shall the owner and creator of the model (Barcelona Supercomputing Center)
84
+ be liable for any results arising from the use made by third parties.
85
+
86
+ </details>