milmor commited on
Commit
7ca117f
1 Parent(s): 243898c

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +20 -14
app.py CHANGED
@@ -6,7 +6,7 @@ os.environ["TOKENIZERS_PARALLELISM"] = "false"
6
 
7
  article='''
8
  # Spanish Nahuatl Automatic Translation
9
- Nahuatl is the most widely spoken indigenous language in Mexico. However, training a neural network for the neural machine translation task is challenging due to the lack of structured data. The most popular datasets, such as the Axolot and bible-corpus, only consist of ~16,000 and ~7,000 samples, respectively. Moreover, there are multiple variants of Nahuatl, which makes this task even more difficult. For example, it is possible to find a single word from the Axolot dataset written in more than three different ways. Therefore, we leverage the T5 text-to-text prefix training strategy in this work to compensate for the lack of data. We first teach the multilingual model Spanish using English, then transition to Spanish-Nahuatl. The resulting model successfully translates short sentences from Spanish to Nahuatl. Finally, we report Chrf and BLEU results.
10
 
11
  ## Motivation
12
 
@@ -14,7 +14,7 @@ One of the United Nations Sustainable Development Goals is ["Reduced Inequalitie
14
 
15
 
16
  ## Model description
17
- This model is a T5 Transformer ([t5-small](https://huggingface.co/t5-small)) fine-tuned on spanish and nahuatl sentences collected from the web. The dataset is normalized using 'sep' normalization from [py-elotl](https://github.com/ElotlMX/py-elotl).
18
 
19
 
20
  ## Usage
@@ -35,7 +35,7 @@ outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
35
 
36
  ## Approach
37
  ### Dataset
38
- Since the Axolotl corpus contains misaligments, we just select the best samples (12,207 samples). We also use the [bible-corpus](https://github.com/christos-c/bible-corpus) (7,821 samples).
39
 
40
  | Axolotl best aligned books |
41
  |:-----------------------------------------------------:|
@@ -54,36 +54,31 @@ Since the Axolotl corpus contains misaligments, we just select the best samples
54
  | Una tortillita nomás - Se taxkaltsin saj |
55
  | Vida económica de Tenochtitlan |
56
 
57
- Also, to increase the amount of data, we collected 3,000 extra samples from the web.
58
 
59
  ### Model and training
60
- We employ two training stages using a multilingual T5-small. We use this model because it can handle different vocabularies and prefixes. T5-small is pre-trained on different tasks and languages (French, Romanian, English, German).
61
 
62
  ### Training-stage 1 (learning Spanish)
63
- In training stage 1 we first introduce Spanish to the model. The goal is to learn a new language rich in data (Spanish) and not lose the previous knowledge acquired. We use the English-Spanish [Anki](https://www.manythings.org/anki/) dataset, which consists of 118,964 text pairs. We train the model till convergence adding the prefix "Translate Spanish to English: ".
64
 
65
  ### Training-stage 2 (learning Nahuatl)
66
- We use the pre-trained Spanish-English model to learn Spanish-Nahuatl. Since the amount of Nahuatl pairs is limited, we also add to our dataset 20,000 samples from the English-Spanish Anki dataset. This two-task-training avoids overfitting end makes the model more robust.
67
 
68
  ### Training setup
69
  We train the models on the same datasets for 660k steps using batch size = 16 and a learning rate of 2e-5.
70
 
71
 
72
  ## Evaluation results
73
- For a fair comparison, the models are evaluated on the same 505 validation Nahuatl sentences. We report the results using chrf and sacrebleu hugging face metrics:
74
 
75
  | English-Spanish pretraining | Validation loss | BLEU | Chrf |
76
  |:----------------------------:|:---------------:|:-----|-------:|
77
  | False | 1.34 | 6.17 | 26.96 |
78
  | True | 1.31 | 6.18 | 28.21 |
79
 
80
- The English-Spanish pretraining improves BLEU and Chrf, and leads to faster convergence. You can reproduce the evaluation on the [eval.ipynb](https://github.com/milmor/spanish-nahuatl-translation/blob/main/eval.ipynb) notebook.
81
 
82
- # Team members
83
- - Emilio Alejandro Morales [(milmor)](https://huggingface.co/milmor)
84
- - Rodrigo Martínez Arzate [(rockdrigoma)](https://huggingface.co/rockdrigoma)
85
- - Luis Armando Mercado [(luisarmando)](https://huggingface.co/luisarmando)
86
- - Jacobo del Valle [(jjdv)](https://huggingface.co/jjdv)
87
 
88
  ## References
89
  - Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits
@@ -91,6 +86,17 @@ of transfer learning with a unified Text-to-Text transformer.
91
 
92
  - Ximena Gutierrez-Vasques, Gerardo Sierra, and Hernandez Isaac. 2016. Axolotl: a web accessible parallel corpus for Spanish-Nahuatl. In International Conference on Language Resources and Evaluation (LREC).
93
 
 
 
 
 
 
 
 
 
 
 
 
94
  '''
95
 
96
  model = AutoModelForSeq2SeqLM.from_pretrained('hackathon-pln-es/t5-small-spanish-nahuatl')
 
6
 
7
  article='''
8
  # Spanish Nahuatl Automatic Translation
9
+ Nahuatl is the most widely spoken indigenous language in Mexico. However, training a neural network for the neural machine translation task is challenging due to the lack of structured data. The most popular datasets, such as the Axolot and bible-corpus, only consist of ~16,000 and ~7,000 samples, respectively. Moreover, there are multiple variants of Nahuatl, which makes this task even more difficult. For example, it is possible to find a single word from the Axolot dataset written in more than three different ways. Therefore, we leverage the T5 text-to-text prefix training strategy to compensate for the lack of data. We first train the multilingual model to learn Spanish and then adapt the model to Nahuatl. The resulting model successfully translates short sentences. Finally, we report Chrf and BLEU results.
10
 
11
  ## Motivation
12
 
 
14
 
15
 
16
  ## Model description
17
+ This model is a T5 Transformer ([t5-small](https://huggingface.co/t5-small)) fine-tuned on Spanish and Nahuatl sentences collected from the web. The dataset is normalized using 'sep' normalization from [py-elotl](https://github.com/ElotlMX/py-elotl).
18
 
19
 
20
  ## Usage
 
35
 
36
  ## Approach
37
  ### Dataset
38
+ Since the Axolotl corpus contains misalignments, we select the best samples (12,207). We also use the [bible-corpus](https://github.com/christos-c/bible-corpus) (7,821).
39
 
40
  | Axolotl best aligned books |
41
  |:-----------------------------------------------------:|
 
54
  | Una tortillita nomás - Se taxkaltsin saj |
55
  | Vida económica de Tenochtitlan |
56
 
57
+ Also, we collected 3,000 extra samples from the web to increase the data.
58
 
59
  ### Model and training
60
+ We employ two training stages using a multilingual T5-small. The advantage of this model is that it can handle different vocabularies and prefixes. T5-small is pre-trained on different tasks and languages (French, Romanian, English, German).
61
 
62
  ### Training-stage 1 (learning Spanish)
63
+ In training stage 1, we first introduce Spanish to the model. The goal is to learn a new language rich in data (Spanish) and not lose the previous knowledge. We use the English-Spanish [Anki](https://www.manythings.org/anki/) dataset, which consists of 118.964 text pairs. Next, we train the model till convergence, adding the prefix "Translate Spanish to English: "
64
 
65
  ### Training-stage 2 (learning Nahuatl)
66
+ We use the pre-trained Spanish-English model to learn Spanish-Nahuatl. Since the amount of Nahuatl pairs is limited, we also add 20,000 samples from the English-Spanish Anki dataset to our dataset. This two-task training avoids overfitting and makes the model more robust.
67
 
68
  ### Training setup
69
  We train the models on the same datasets for 660k steps using batch size = 16 and a learning rate of 2e-5.
70
 
71
 
72
  ## Evaluation results
73
+ We evaluate the model on the same 505 validation Nahuatl sentences for a fair comparison. Finally, we report the results using chrf and sacrebleu hugging face metrics:
74
 
75
  | English-Spanish pretraining | Validation loss | BLEU | Chrf |
76
  |:----------------------------:|:---------------:|:-----|-------:|
77
  | False | 1.34 | 6.17 | 26.96 |
78
  | True | 1.31 | 6.18 | 28.21 |
79
 
 
80
 
81
+ The English-Spanish pretraining improves BLEU and Chrf and leads to faster convergence. Is it possible to reproduce the evaluation on the [eval.ipynb](https://github.com/milmor/spanish-nahuatl-translation/blob/main/eval.ipynb) notebook.
 
 
 
 
82
 
83
  ## References
84
  - Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits
 
86
 
87
  - Ximena Gutierrez-Vasques, Gerardo Sierra, and Hernandez Isaac. 2016. Axolotl: a web accessible parallel corpus for Spanish-Nahuatl. In International Conference on Language Resources and Evaluation (LREC).
88
 
89
+ - https://github.com/christos-c/bible-corpus
90
+
91
+ - https://github.com/ElotlMX/py-elotl
92
+
93
+
94
+ ## Team members
95
+ - Emilio Alejandro Morales [(milmor)](https://huggingface.co/milmor)
96
+ - Rodrigo Martínez Arzate [(rockdrigoma)](https://huggingface.co/rockdrigoma)
97
+ - Luis Armando Mercado [(luisarmando)](https://huggingface.co/luisarmando)
98
+ - Jacobo del Valle [(jjdv)](https://huggingface.co/jjdv)
99
+
100
  '''
101
 
102
  model = AutoModelForSeq2SeqLM.from_pretrained('hackathon-pln-es/t5-small-spanish-nahuatl')