File size: 7,797 Bytes
ed95241
7c96d4d
da9fda5
 
 
56423fb
28aaaed
da9fda5
56423fb
 
0a36376
56423fb
da9fda5
 
 
 
 
 
 
18dbafb
 
7c96d4d
da9fda5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56423fb
 
 
da9fda5
56423fb
da9fda5
56423fb
da9fda5
 
 
56423fb
 
da9fda5
2003308
da9fda5
 
28aaaed
da9fda5
 
28aaaed
da9fda5
 
28aaaed
da9fda5
 
0a36376
da9fda5
 
 
 
 
 
 
 
 
 
0a36376
da9fda5
56423fb
25bd5a0
 
 
 
56423fb
da9fda5
 
 
 
 
22d19c7
1ae55fd
 
18dbafb
 
 
 
333c807
4366f76
18dbafb
 
24dbdff
 
 
 
18dbafb
 
 
23b7f6d
956e550
2ae896a
956e550
2c5321c
26b6523
ba4cd1f
74484cc
1271ce0
 
f2ea4cf
1271ce0
 
 
cf05f54
74b256d
cf05f54
1271ce0
1ae55fd
792cfa4
3c0fbbd
24dbdff
df5be20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import os
import gradio as gr
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

article='''
# Spanish Nahuatl Automatic Translation
Nahuatl is the most widely spoken indigenous language in Mexico. However, training a neural network for the neural machine translation task is hard due to the lack of structured data. The most popular datasets such as the Axolot dataset and the bible-corpus only consist of ~16,000 and ~7,000 samples respectively. Moreover, there are multiple variants of Nahuatl, which makes this task even more difficult. For example, a single word from the Axolot dataset can be found written in more than three different ways. Therefore, in this work, we leverage the T5 text-to-text prefix training strategy to compensate for the lack of data. We first teach the multilingual model Spanish using English, then we make the transition to Spanish-Nahuatl. The resulting model successfully translates short sentences from Spanish to Nahuatl. We report Chrf and BLEU results.

## Motivation

One of the United Nations Sustainable Development Goals is ["Reduced Inequalities"](https://www.un.org/sustainabledevelopment/inequality/). We know for sure that language is one of the most powerful tools we have and a way to distribute knowledge and experience. But most of the progress that has been done among important topics like technology, education, human rights and law, news and so on, is biased due to lack of resources in different languages. We expect this approach to become an important platform for others in order to reduce inequality and get all Nahuatl speakers closer to what they need to thrive and why not, share with us their valuable knowledge, costumes and way of living.


## Model description
This model is a T5 Transformer ([t5-small](https://huggingface.co./t5-small)) fine-tuned on spanish and nahuatl sentences collected from the web. The dataset is normalized using 'sep' normalization from [py-elotl](https://github.com/ElotlMX/py-elotl).


## Usage
```python
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained('hackathon-pln-es/t5-small-spanish-nahuatl')
tokenizer = AutoTokenizer.from_pretrained('hackathon-pln-es/t5-small-spanish-nahuatl')

model.eval()
sentence = 'muchas flores son blancas'
input_ids = tokenizer('translate Spanish to Nahuatl: ' + sentence, return_tensors='pt').input_ids
outputs = model.generate(input_ids)
# outputs = miak xochitl istak
outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
```

## Approach
### Dataset
Since the Axolotl corpus contains misaligments, we just select the best samples (12,207 samples). We also use the [bible-corpus](https://github.com/christos-c/bible-corpus) (7,821 samples). 

| Axolotl best aligned books                            | 
|:-----------------------------------------------------:|
| Anales de Tlatelolco                                  | 
| Diario                                                |  
| Documentos nauas de la Ciudad de México del siglo XVI |  
| Historia de México narrada en náhuatl y español       |  
| La tinta negra y roja (antología de poesía náhuatl)   |  
| Memorial Breve (Libro las ocho relaciones)            |  
| Método auto-didáctico náhuatl-español                 |  
| Nican Mopohua                                         | 
| Quinta Relación (Libro las ocho relaciones)           |   
| Recetario Nahua de Milpa Alta D.F                     | 
| Testimonios de la antigua palabra                     |
| Trece Poetas del Mundo Azteca                         |
| Una tortillita nomás - Se taxkaltsin saj              |
| Vida económica de Tenochtitlan                        |

Also, to increase the amount of data, we collected 3,000 extra samples from the web.

### Model and training
We employ two training stages using a multilingual T5-small. We use this model because it can handle different vocabularies and prefixes. T5-small is pre-trained on different tasks and languages (French, Romanian, English, German).

### Training-stage 1 (learning Spanish)
In training stage 1 we first introduce Spanish to the model. The goal is to learn a new language rich in data (Spanish) and not lose the previous knowledge acquired. We use the English-Spanish [Anki](https://www.manythings.org/anki/) dataset, which consists of 118,964 text pairs. We train the model till convergence adding the prefix "Translate Spanish to English: ".

### Training-stage 2 (learning Nahuatl)
We use the pre-trained Spanish-English model to learn Spanish-Nahuatl. Since the amount of Nahuatl pairs is limited, we also add to our dataset 20,000 samples from the English-Spanish Anki dataset. This two-task-training avoids overfitting end makes the model more robust.

### Training setup
We train the models on the same datasets for 660k steps using batch size = 16 and a learning rate of 2e-5.


## Evaluation results
For a fair comparison, the models are evaluated on the same 505 validation  Nahuatl sentences. We report the results using chrf and sacrebleu hugging face metrics:

| English-Spanish pretraining  | Validation loss | BLEU | Chrf   |
|:----------------------------:|:---------------:|:-----|-------:|
| False                        | 1.34            | 6.17 | 26.96  | 
| True                         | 1.31            | 6.18 | 28.21  | 

The English-Spanish pretraining improves BLEU and Chrf, and leads to faster convergence. You can reproduce the evaluation on the [eval.ipynb](https://github.com/milmor/spanish-nahuatl-translation/blob/main/eval.ipynb) notebook.

# Team members
- Emilio Alejandro Morales [(milmor)](https://huggingface.co./milmor)
- Rodrigo Martínez Arzate  [(rockdrigoma)](https://huggingface.co./rockdrigoma)
- Luis Armando Mercado [(luisarmando)](https://huggingface.co./luisarmando)
- Jacobo del Valle [(jjdv)](https://huggingface.co./jjdv)

## References
- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits
of transfer learning with a unified Text-to-Text transformer.

- Ximena Gutierrez-Vasques, Gerardo Sierra, and Hernandez Isaac. 2016. Axolotl: a web accessible parallel corpus for Spanish-Nahuatl. In International Conference on Language Resources and Evaluation (LREC).

'''

model = AutoModelForSeq2SeqLM.from_pretrained('hackathon-pln-es/t5-small-spanish-nahuatl')
tokenizer = AutoTokenizer.from_pretrained('hackathon-pln-es/t5-small-spanish-nahuatl')

def predict(input):
  input_ids = tokenizer('translate Spanish to Nahuatl: ' + input, return_tensors='pt').input_ids
  outputs = model.generate(input_ids, max_length=512)
  outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
  return outputs
  
HF_TOKEN = os.getenv('spanish-nahuatl-flagging')

hf_writer = gr.HuggingFaceDatasetSaver(HF_TOKEN, "spanish-nahuatl-flagging")

gr.Interface(
   fn=predict,
   inputs=gr.inputs.Textbox(lines=1, label="Input Text in Spanish"),
   outputs=[
     gr.outputs.Textbox(label="Translated text in Nahuatl"),
     ],
   theme="peach",
   title='🌽 Spanish to Nahuatl Automatic Translation',
   description='Insert your text in Spanish in the left text box and you will get its Nahuatl translation on the right text box',
   examples=[
     'conejo',
     'estrella',
     'Muchos perros son blancos',
     'te amo',
     'quiero comer',
     'esto se llama agua',
     'Mi hermano es un ajolote',
     'mi abuelo se llama Juan',
     'El pueblo del ajolote',
     'te amo con todo mi corazón'],
   article=article,
   allow_flagging="manual",
   flagging_options=["right translation", "wrong translation", "error", "other"],
   flagging_callback=hf_writer,
   ).launch(enable_queue=True)