dominguesm commited on
Commit
de6f73b
·
1 Parent(s): 486e2e9

Atualização README

Browse files
Files changed (2) hide show
  1. README.md +179 -4
  2. README_ptbr.md +221 -0
README.md CHANGED
@@ -1,7 +1,7 @@
1
  ---
2
  language:
3
  - yrl
4
- license: cc-by-4.0
5
  pipeline_tag: token-classification
6
  tags:
7
  - named-entity-recognition
@@ -38,12 +38,187 @@ widget:
38
  - text: "Asuí kwá mukũi apigawa-itá aintá usemu kaá kití aintá upurakí arama balata, asuí mairamé aintá usika ana iwitera rupitá-pe, ape aintá umaã siya kumã iwa-itá."
39
  ---
40
 
 
 
41
  <p align="center">
42
  <img width="350" alt="Camarim Logo" src="https://raw.githubusercontent.com/DominguesM/canarim-bert-nheengatu/main/assets/canarim-yrl-nbg.png">
43
  </p>
44
 
45
- <hr>
46
 
47
- # Canarim-Bert-PosTag-Nheengatu
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
- WIP
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  language:
3
  - yrl
4
+ license: cc-by-nc-4.0
5
  pipeline_tag: token-classification
6
  tags:
7
  - named-entity-recognition
 
38
  - text: "Asuí kwá mukũi apigawa-itá aintá usemu kaá kití aintá upurakí arama balata, asuí mairamé aintá usika ana iwitera rupitá-pe, ape aintá umaã siya kumã iwa-itá."
39
  ---
40
 
41
+ # Canarim-Bert-PosTag-Nheengatu
42
+
43
  <p align="center">
44
  <img width="350" alt="Camarim Logo" src="https://raw.githubusercontent.com/DominguesM/canarim-bert-nheengatu/main/assets/canarim-yrl-nbg.png">
45
  </p>
46
 
47
+ <br/>
48
 
49
+ ## About
50
+
51
+ The `canarim-bert-posTag-nheengatu` model is a part-of-speech tagging model for the Nheengatu language, trained using the `UD_Nheengatu-CompLin` dataset available on [github](https://github.com/UniversalDependencies/UD_Nheengatu-CompLin/). It is based on the tokenizer and the [`Canarim-Bert-Nheengatu`](https://huggingface.co/dominguesm/canarim-bert-nheengatu) model.
52
+
53
+ ## Supported Tags
54
+
55
+ The model can identify the following grammatical classes:
56
+
57
+ |**tag**|**abbreviation in glossary**|**expansion of abbreviation**|
58
+ |-------|-----------------------------|-----------------------------|
59
+ |ADJ|adj.|1st class adjective|
60
+ |ADP|posp.|postposition|
61
+ |ADV|adv.|adverb|
62
+ |AUX|aux.|auxiliary|
63
+ |CCONJ|cconj.|coordinating conjunction|
64
+ |DET|det.|determiner|
65
+ |INTJ|interj.|interjection|
66
+ |NOUN|n.|1st class noun|
67
+ |NUM|num.|numeral|
68
+ |PART|part.|particle|
69
+ |PRON|pron.|1st class pronoun|
70
+ |PROPN|prop.|proper noun|
71
+ |PUNCT|punct.|punctuation|
72
+ |SCONJ|sconj.|subordinating conjunction|
73
+ |VERB|v.|1st class verb|
74
+
75
+ ## Training
76
+
77
+ ### Dataset
78
+
79
+ The dataset used for training was the [`UD_Nheengatu-CompLin`](https://github.com/UniversalDependencies/UD_Nheengatu-CompLin/), divided into 80/10/10 proportions for training, evaluation, and testing, respectively.
80
+
81
+
82
+ ```
83
+ DatasetDict({
84
+ train: Dataset({
85
+ features: ['id', 'tokens', 'pos_tags', 'text'],
86
+ num_rows: 1068
87
+ })
88
+ test: Dataset({
89
+ features: ['id', 'tokens', 'pos_tags', 'text'],
90
+ num_rows: 134
91
+ })
92
+ eval: Dataset({
93
+ features: ['id', 'tokens', 'pos_tags', 'text'],
94
+ num_rows: 134
95
+ })
96
+ })
97
+ ```
98
+
99
+ ### Hyperparameters
100
+
101
+ The hyperparameters used for training were:
102
+
103
+ * `learning_rate`: 3e-4
104
+ * `train_batch_size`: 16
105
+ * `eval_batch_size`: 32
106
+ * `gradient_accumulation_steps`: 1
107
+ * `weight_decay`: 0.01
108
+ * `num_train_epochs`: 10
109
+
110
+ ### Results
111
+
112
+ The training and validation loss over the steps can be seen below:
113
+
114
+ <p align="center">
115
+ <img width="600" alt="Train Loss" src="https://raw.githubusercontent.com/DominguesM/canarim-bert-nheengatu/main/assets/postag-train-loss.png">
116
+ </p>
117
+
118
+ <p align="center">
119
+ <img width="600" alt="Eval Loss" src="https://raw.githubusercontent.com/DominguesM/canarim-bert-nheengatu/main/assets/postag-eval-loss.png">
120
+ </p>
121
+
122
+ The model's results on the evaluation set can be viewed below:
123
+
124
+ ```
125
+ {
126
+ 'eval_loss': 0.5337784886360168,
127
+ 'eval_precision': 0.913735899137359,
128
+ 'eval_recall': 0.913735899137359,
129
+ 'eval_f1': 0.913735899137359,
130
+ 'eval_accuracy': 0.913735899137359,
131
+ 'eval_runtime': 0.1957,
132
+ 'eval_samples_per_second': 684.883,
133
+ 'eval_steps_per_second': 25.555,
134
+ 'epoch': 10.0
135
+ }
136
+ ```
137
+
138
+ ### Metrics
139
+
140
+ The model's evaluation metrics on the test set can be viewed below:
141
+
142
+ ```
143
+ precision recall f1-score support
144
+
145
+ ADJ 0.7895 0.6522 0.7143 23
146
+ ADP 0.9355 0.9158 0.9255 95
147
+ ADV 0.8261 0.8172 0.8216 93
148
+ AUX 0.9444 0.9189 0.9315 37
149
+ CCONJ 0.7778 0.8750 0.8235 8
150
+ DET 0.8776 0.9149 0.8958 47
151
+ INTJ 0.5000 0.5000 0.5000 4
152
+ NOUN 0.9257 0.9222 0.9239 270
153
+ NUM 1.0000 0.6667 0.8000 6
154
+ PART 0.9775 0.9062 0.9405 96
155
+ PRON 0.9568 1.0000 0.9779 155
156
+ PROPN 0.6429 0.4286 0.5143 21
157
+ PUNCT 0.9963 1.0000 0.9981 267
158
+ SCONJ 0.8000 0.7500 0.7742 32
159
+ VERB 0.8651 0.9347 0.8986 199
160
+
161
+ micro avg 0.9202 0.9202 0.9202 1353
162
+ macro avg 0.8543 0.8135 0.8293 1353
163
+ weighted avg 0.9191 0.9202 0.9187 1353
164
+ ```
165
+
166
+ <br/>
167
+
168
+ <p align="center">
169
+ <img width="600" alt="Canarim BERT Nheengatu - POSTAG - Confusion Matrix" src="https://raw.githubusercontent.com/DominguesM/canarim-bert-nheengatu/main/assets/postag-confusion-matrix.png">
170
+ </p>
171
+
172
+ ## Usage
173
+
174
+ The use of this model follows the common standards of the [transformers](https://github.com/huggingface/transformers) library. To use it, simply install the library and load the model:
175
+
176
+
177
+ ```python
178
+ from transformers import pipeline
179
+
180
+ model_name = "dominguesm/canarim-bert-postag-nheengatu"
181
+
182
+ pipe = pipeline("ner", model=model_name)
183
+
184
+ pipe("Yamunhã timbiú, yapinaitika, yamunhã kaxirí.", aggregation_strategy="average")
185
+ ```
186
+
187
+ The result will be:
188
+
189
+ ```json
190
+ [
191
+ {"entity_group": "VERB", "score": 0.999668, "word": "Yamunhã", "start": 0, "end": 7},
192
+ {"entity_group": "NOUN", "score": 0.99986947, "word": "timbiú", "start": 8, "end": 14},
193
+ {"entity_group": "PUNCT", "score": 0.99993193, "word": ",", "start": 14, "end": 15},
194
+ {"entity_group": "VERB", "score": 0.9995308, "word": "yapinaitika", "start": 16, "end": 27},
195
+ {"entity_group": "PUNCT", "score": 0.9999416, "word": ",", "start": 27, "end": 28},
196
+ {"entity_group": "VERB", "score": 0.99955815, "word": "yamunhã", "start": 29, "end": 36},
197
+ {"entity_group": "NOUN", "score": 0.9998684, "word": "kaxirí", "start": 37, "end": 43},
198
+ {"entity_group": "PUNCT", "score": 0.99997807, "word": ".", "start": 43, "end": 44}
199
+ ]
200
+ ```
201
+
202
+ ## License
203
+
204
+ The license of this model follows that of the dataset used for training, which is [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode). For more information, please visit the [dataset repository](https://github.com/UniversalDependencies/UD_Nheengatu-CompLin/tree/master).
205
+
206
+
207
+ ## References
208
 
209
+ ```bibtex
210
+ @inproceedings{stil,
211
+ author = {Leonel de Alencar},
212
+ title = {Yauti: A Tool for Morphosyntactic Analysis of Nheengatu within the Universal Dependencies Framework},
213
+ booktitle = {Anais do XIV Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana},
214
+ location = {Belo Horizonte/MG},
215
+ year = {2023},
216
+ keywords = {},
217
+ issn = {0000-0000},
218
+ pages = {135--145},
219
+ publisher = {SBC},
220
+ address = {Porto Alegre, RS, Brasil},
221
+ doi = {10.5753/stil.2023.234131},
222
+ url = {https://sol.sbc.org.br/index.php/stil/article/view/25445}
223
+ }
224
+ ```
README_ptbr.md ADDED
@@ -0,0 +1,221 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - yrl
4
+ license: cc-by-nc-4.0
5
+ pipeline_tag: token-classification
6
+ tags:
7
+ - named-entity-recognition
8
+ - Transformer
9
+ - pytorch
10
+ - bert
11
+ - nheengatu
12
+ metrics:
13
+ - f1
14
+ - precision
15
+ - recall
16
+ model-index:
17
+ - name: canarim-bert-postag-nheengatu
18
+ results:
19
+ - task:
20
+ type: named-entity-recognition
21
+ dataset:
22
+ type: UD_Nheengatu-CompLin
23
+ name: UD Nheengatu CompLin
24
+ metrics:
25
+ - type: f1
26
+ value: 82.93
27
+ name: F1 Score
28
+ - type: accuracy
29
+ value: 92.02
30
+ name: Accuracy
31
+ - type: recall
32
+ value: 81.35
33
+ name: Recall
34
+ widget:
35
+ - text: "Apigawa i paya waá umurari iké, sera José."
36
+ - text: "Asú apagari nhaã apigawa supé."
37
+ - text: "― Taukwáu ra."
38
+ - text: "Asuí kwá mukũi apigawa-itá aintá usemu kaá kití aintá upurakí arama balata, asuí mairamé aintá usika ana iwitera rupitá-pe, ape aintá umaã siya kumã iwa-itá."
39
+ ---
40
+
41
+ # Canarim-Bert-PosTag-Nheengatu
42
+
43
+ <p align="center">
44
+ <img width="350" alt="Camarim Logo" src="https://raw.githubusercontent.com/DominguesM/canarim-bert-nheengatu/main/assets/canarim-yrl-nbg.png">
45
+ </p>
46
+
47
+ <br/>
48
+
49
+ ## Sobre
50
+
51
+ O modelo `canarim-bert-posTag-nheengatu` é um modelo de marcação de classe gramatical para a língua Nheengatu que foi treinado no conjunto de dados `UD_Nheengatu-CompLin` disponível no [github](https://github.com/UniversalDependencies/UD_Nheengatu-CompLin/). Foi utilizado como base o tokenizador e o modelo [`Canarim-Bert-Nheengatu`](https://huggingface.co/dominguesm/canarim-bert-nheengatu).
52
+
53
+ ## Etiquetas Suportadas
54
+
55
+ O modelo é capaz de identificar as seguintes classes gramaticais:
56
+
57
+ |**etiqueta**|**abreviatura no glossário**|**expansão da abreviatura**|
58
+ |------------|----------------------------|---------------------------|
59
+ |ADJ|adj.|adjetivo de 1ª cl.|
60
+ |ADP|posp.|posposição|
61
+ |ADV|adv.|advérbio|
62
+ |AUX|aux.|auxiliar|
63
+ |CCONJ|cconj.|conjunção coordenativa|
64
+ |DET|det.|determinante|
65
+ |INTJ|interj.|interjeição|
66
+ |NOUN|n.|substantivo de 1ª classe|
67
+ |NUM|num.|numeral|
68
+ |PART|part.|partícula|
69
+ |PRON|pron.|pronome de 1ª classe|
70
+ |PROPN|prop.|substantivo próprio|
71
+ |PUNCT|punct.|pontuação|
72
+ |SCONJ|sconj.|conjunção subordinativa|
73
+ |VERB|v.|verbo de 1ª classe|
74
+
75
+ ## Treinamento
76
+
77
+ ### Conjunto de Dados
78
+
79
+ O conjunto de dados utilizado para o treinamento foi o [`UD_Nheengatu-CompLin`](https://github.com/UniversalDependencies/UD_Nheengatu-CompLin/), dividido na proporção 80/10/10 para treino, avaliação e teste, respectivamente.
80
+
81
+ ```
82
+ DatasetDict({
83
+ train: Dataset({
84
+ features: ['id', 'tokens', 'pos_tags', 'text'],
85
+ num_rows: 1068
86
+ })
87
+ test: Dataset({
88
+ features: ['id', 'tokens', 'pos_tags', 'text'],
89
+ num_rows: 134
90
+ })
91
+ eval: Dataset({
92
+ features: ['id', 'tokens', 'pos_tags', 'text'],
93
+ num_rows: 134
94
+ })
95
+ })
96
+ ```
97
+
98
+ ### Hiperparâmetros
99
+
100
+ Os hiperparâmetros utilizados para o treinamento foram:
101
+
102
+ * `learning_rate`: 3e-4
103
+ * `train_batch_size`: 16
104
+ * `eval_batch_size`: 32
105
+ * `gradient_accumulation_steps`: 1
106
+ * `weight_decay`: 0.01
107
+ * `num_train_epochs`: 10
108
+
109
+ ### Resultados
110
+
111
+ A perca de treinamento e validação ao longo das épocas pode ser visualizada abaixo:
112
+
113
+ <p align="center">
114
+ <img width="600" alt="Train Loss" src="https://raw.githubusercontent.com/DominguesM/canarim-bert-nheengatu/main/assets/postag-train-loss.png">
115
+ </p>
116
+
117
+ <p align="center">
118
+ <img width="600" alt="Eval Loss" src="https://raw.githubusercontent.com/DominguesM/canarim-bert-nheengatu/main/assets/postag-eval-loss.png">
119
+ </p>
120
+
121
+ Os resultados do modelo no conjunto de avaliação podem ser visualizados abaixo:
122
+
123
+ ```
124
+ {
125
+ 'eval_loss': 0.5337784886360168,
126
+ 'eval_precision': 0.913735899137359,
127
+ 'eval_recall': 0.913735899137359,
128
+ 'eval_f1': 0.913735899137359,
129
+ 'eval_accuracy': 0.913735899137359,
130
+ 'eval_runtime': 0.1957,
131
+ 'eval_samples_per_second': 684.883,
132
+ 'eval_steps_per_second': 25.555,
133
+ 'epoch': 10.0
134
+ }
135
+ ```
136
+
137
+ ### Métricas
138
+
139
+ As métricas de avaliação do modelo no conjunto de teste podem ser visualizadas abaixo:
140
+
141
+ ```
142
+ precision recall f1-score support
143
+
144
+ ADJ 0.7895 0.6522 0.7143 23
145
+ ADP 0.9355 0.9158 0.9255 95
146
+ ADV 0.8261 0.8172 0.8216 93
147
+ AUX 0.9444 0.9189 0.9315 37
148
+ CCONJ 0.7778 0.8750 0.8235 8
149
+ DET 0.8776 0.9149 0.8958 47
150
+ INTJ 0.5000 0.5000 0.5000 4
151
+ NOUN 0.9257 0.9222 0.9239 270
152
+ NUM 1.0000 0.6667 0.8000 6
153
+ PART 0.9775 0.9062 0.9405 96
154
+ PRON 0.9568 1.0000 0.9779 155
155
+ PROPN 0.6429 0.4286 0.5143 21
156
+ PUNCT 0.9963 1.0000 0.9981 267
157
+ SCONJ 0.8000 0.7500 0.7742 32
158
+ VERB 0.8651 0.9347 0.8986 199
159
+
160
+ micro avg 0.9202 0.9202 0.9202 1353
161
+ macro avg 0.8543 0.8135 0.8293 1353
162
+ weighted avg 0.9191 0.9202 0.9187 1353
163
+ ```
164
+
165
+ <br/>
166
+
167
+ <p align="center">
168
+ <img width="600" alt="Canarim BERT Nheengatu - POSTAG - Confusion Matrix" src="https://raw.githubusercontent.com/DominguesM/canarim-bert-nheengatu/main/assets/postag-confusion-matrix.png">
169
+ </p>
170
+
171
+ ## Uso
172
+
173
+ A utilização deste modelo segue os padrões comuns da biblioteca [transformers](https://github.com/huggingface/transformers). Para utilizá-lo, basta instalar a biblioteca e carregar o modelo:
174
+
175
+ ```python
176
+ from transformers import pipeline
177
+
178
+ model_name = "dominguesm/canarim-bert-postag-nheengatu"
179
+
180
+ pipe = pipeline("ner", model=model_name)
181
+
182
+ pipe("Yamunhã timbiú, yapinaitika, yamunhã kaxirí.", aggregation_strategy="average")
183
+ ```
184
+
185
+ O resultado será:
186
+
187
+ ```json
188
+ [
189
+ {"entity_group": "VERB", "score": 0.999668, "word": "Yamunhã", "start": 0, "end": 7},
190
+ {"entity_group": "NOUN", "score": 0.99986947, "word": "timbiú", "start": 8, "end": 14},
191
+ {"entity_group": "PUNCT", "score": 0.99993193, "word": ",", "start": 14, "end": 15},
192
+ {"entity_group": "VERB", "score": 0.9995308, "word": "yapinaitika", "start": 16, "end": 27},
193
+ {"entity_group": "PUNCT", "score": 0.9999416, "word": ",", "start": 27, "end": 28},
194
+ {"entity_group": "VERB", "score": 0.99955815, "word": "yamunhã", "start": 29, "end": 36},
195
+ {"entity_group": "NOUN", "score": 0.9998684, "word": "kaxirí", "start": 37, "end": 43},
196
+ {"entity_group": "PUNCT", "score": 0.99997807, "word": ".", "start": 43, "end": 44}
197
+ ]
198
+ ```
199
+
200
+ ## Licença
201
+
202
+ A licença deste modelo segue a licença do conjunto de dados utilizado para o treinamento, ou seja, [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode). Para mais informações, acesse o [repositório do conjunto de dados](https://github.com/UniversalDependencies/UD_Nheengatu-CompLin/tree/master)
203
+
204
+ ## Referências
205
+
206
+ ```bibtex
207
+ @inproceedings{stil,
208
+ author = {Leonel de Alencar},
209
+ title = {Yauti: A Tool for Morphosyntactic Analysis of Nheengatu within the Universal Dependencies Framework},
210
+ booktitle = {Anais do XIV Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana},
211
+ location = {Belo Horizonte/MG},
212
+ year = {2023},
213
+ keywords = {},
214
+ issn = {0000-0000},
215
+ pages = {135--145},
216
+ publisher = {SBC},
217
+ address = {Porto Alegre, RS, Brasil},
218
+ doi = {10.5753/stil.2023.234131},
219
+ url = {https://sol.sbc.org.br/index.php/stil/article/view/25445}
220
+ }
221
+ ```