rockdrigoma commited on
Commit
56423fb
1 Parent(s): da9fda5

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +18 -8
app.py CHANGED
@@ -2,9 +2,13 @@ import gradio as gr
2
  from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
3
 
4
  article='''
5
- # t5-small-spanish-nahuatl
6
  Nahuatl is the most widely spoken indigenous language in Mexico. However, training a neural network for the task of neural machine tranlation is hard due to the lack of structured data. The most popular datasets such as the Axolot dataset and the bible-corpus only consists of ~16,000 and ~7,000 samples respectivly. Moreover, there are multiple variants of Nahuatl, which makes this task even more difficult. For example, a single word from the Axolot dataset can be found written in more than three different ways. Therefore, in this work we leverage the T5 text-to-text sufix training strategy to compensate the lack of data. We first teach the multilingual model Spanish using English, then we make the transition to Spanish-Nahuatl. The resulting model successfully translates short sentences from Spanish to Nahuatl. We report Chrf and BLEU results.
7
 
 
 
 
 
8
 
9
  ## Model description
10
  This model is a T5 Transformer ([t5-small](https://huggingface.co/t5-small)) fine-tuned on spanish and nahuatl sentences collected from the web. The dataset is normalized using 'sep' normalization from [py-elotl](https://github.com/ElotlMX/py-elotl).
@@ -34,18 +38,18 @@ Since the Axolotl corpus contains misaligments, we just select the best samples
34
  |:-----------------------------------------------------:|
35
  | Anales de Tlatelolco |
36
  | Diario |
37
- | Documentos nauas de la Ciudad de México del siglo XVI |
38
- | Historia de México narrada en náhuatl y español |
39
- | La tinta negra y roja (antología de poesía náhuatl) |
40
  | Memorial Breve (Libro las ocho relaciones) |
41
- | Método auto-didáctico náhuatl-español |
42
  | Nican Mopohua |
43
- | Quinta Relación (Libro las ocho relaciones) |
44
  | Recetario Nahua de Milpa Alta D.F |
45
  | Testimonios de la antigua palabra |
46
  | Trece Poetas del Mundo Azteca |
47
- | Una tortillita nomás - Se taxkaltsin saj |
48
- | Vida económica de Tenochtitlan |
49
 
50
  Also, to increase the amount of data we collected 3,000 extra samples from the web.
51
 
@@ -72,6 +76,12 @@ For a fair comparison, the models are evaluated on the same 505 validation Nahu
72
 
73
  The English-Spanish pretraining improves BLEU and Chrf, and leads to faster convergence.
74
 
 
 
 
 
 
 
75
  ## References
76
  - Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits
77
  of transfer learning with a unified Text-to-Text transformer.
 
2
  from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
3
 
4
  article='''
5
+ # Spanish Nahuatl Automatic Translation
6
  Nahuatl is the most widely spoken indigenous language in Mexico. However, training a neural network for the task of neural machine tranlation is hard due to the lack of structured data. The most popular datasets such as the Axolot dataset and the bible-corpus only consists of ~16,000 and ~7,000 samples respectivly. Moreover, there are multiple variants of Nahuatl, which makes this task even more difficult. For example, a single word from the Axolot dataset can be found written in more than three different ways. Therefore, in this work we leverage the T5 text-to-text sufix training strategy to compensate the lack of data. We first teach the multilingual model Spanish using English, then we make the transition to Spanish-Nahuatl. The resulting model successfully translates short sentences from Spanish to Nahuatl. We report Chrf and BLEU results.
7
 
8
+ ## Motivation
9
+
10
+ One of the Sustainable Development Goals is "Reduced Inequalities". We know for sure that language is one
11
+
12
 
13
  ## Model description
14
  This model is a T5 Transformer ([t5-small](https://huggingface.co/t5-small)) fine-tuned on spanish and nahuatl sentences collected from the web. The dataset is normalized using 'sep' normalization from [py-elotl](https://github.com/ElotlMX/py-elotl).
 
38
  |:-----------------------------------------------------:|
39
  | Anales de Tlatelolco |
40
  | Diario |
41
+ | Documentos nauas de la Ciudad de México del siglo XVI |
42
+ | Historia de México narrada en náhuatl y español |
43
+ | La tinta negra y roja (antología de poesía náhuatl) |
44
  | Memorial Breve (Libro las ocho relaciones) |
45
+ | Método auto-didáctico náhuatl-español |
46
  | Nican Mopohua |
47
+ | Quinta Relación (Libro las ocho relaciones) |
48
  | Recetario Nahua de Milpa Alta D.F |
49
  | Testimonios de la antigua palabra |
50
  | Trece Poetas del Mundo Azteca |
51
+ | Una tortillita nomás - Se taxkaltsin saj |
52
+ | Vida económica de Tenochtitlan |
53
 
54
  Also, to increase the amount of data we collected 3,000 extra samples from the web.
55
 
 
76
 
77
  The English-Spanish pretraining improves BLEU and Chrf, and leads to faster convergence.
78
 
79
+ # Team members
80
+ 7 - Emilio Alejandro Morales [(milmor)](https://huggingface.co/milmor)
81
+ 8 - Rodrigo Martínez Arzate [(rockdrigoma)](https://huggingface.co/rockdrigoma)
82
+ 9 - Luis Armando Mercado [(luisarmando)](https://huggingface.co/luisarmando)
83
+ 10 - Jacobo del Valle [(jjdv)](https://huggingface.co/jjdv)
84
+
85
  ## References
86
  - Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits
87
  of transfer learning with a unified Text-to-Text transformer.