gonzalez-agirre commited on
Commit
edf9c48
1 Parent(s): 053b79e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +76 -13
README.md CHANGED
@@ -1,26 +1,53 @@
1
  ---
2
  language:
 
3
  - es
 
4
  license: apache-2.0
 
5
  tags:
 
6
  - "national library of spain"
 
7
  - "spanish"
 
8
  - "bne"
 
9
  - "capitel"
 
10
  - "pos"
 
11
  datasets:
 
12
  - "bne"
13
- - "capitel"
 
 
14
  metrics:
 
15
  - "f1"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  widget:
17
  - text: "Festival de San Sebastián: Johnny Depp recibirá el premio Donostia en pleno rifirrafe judicial con Amber Heard"
18
  - text: "El alcalde de Vigo, Abel Caballero, ha comenzado a colocar las luces de Navidad en agosto."
19
  - text: "Gracias a los datos de la BNE, se ha podido lograr este modelo del lenguaje."
20
- - text: "El Tribunal Superior de Justicia se pronunció ayer: \"Hay base legal dentro del marco jurídico actual\"."
21
- inference:
22
- parameters:
23
- aggregation_strategy: "first"
24
 
25
  ---
26
 
@@ -35,7 +62,13 @@ inference:
35
  - [How to use](#how-to-use)
36
  - [Limitations and bias](#limitations-and-bias)
37
  - [Training](#training)
 
 
 
38
  - [Evaluation](#evaluation)
 
 
 
39
  - [Additional information](#additional-information)
40
  - [Author](#author)
41
  - [Contact information](#contact-information)
@@ -48,28 +81,56 @@ inference:
48
  </details>
49
 
50
  ## Model description
51
- RoBERTa-large-bne is a transformer-based masked language model for the Spanish language. It is based on the [RoBERTa](https://arxiv.org/abs/1907.11692) large model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) from 2009 to 2019.
52
-
53
- Original pre-trained model can be found here: https://huggingface.co/BSC-TeMU/roberta-large-bne
54
 
55
 
56
- ## Intended uses and limitations
 
57
 
58
  ## How to use
59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
  ## Limitations and bias
61
  At the time of submission, no measures have been taken to estimate the bias embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
62
 
63
  ## Training
64
-
65
  The dataset used is the one from the [CAPITEL competition at IberLEF 2020](https://sites.google.com/view/capitel2020) (sub-task 2).
66
 
67
- ## Evaluation
68
- F1 Score: 0.9851 (average of 5 runs).
 
 
 
 
 
69
 
70
- For evaluation details visit our [GitHub repository](https://github.com/PlanTL-GOB-ES/lm-spanish).
 
71
 
72
 
 
 
 
 
 
 
 
 
 
 
 
73
  ## Additional information
74
 
75
  ### Author
@@ -107,6 +168,8 @@ Intelligence (SEDIA) within the framework of the Plan-TL.},
107
  }
108
 
109
  ```
 
 
110
  ### Disclaimer
111
 
112
  The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
 
1
  ---
2
  language:
3
+
4
  - es
5
+
6
  license: apache-2.0
7
+
8
  tags:
9
+
10
  - "national library of spain"
11
+
12
  - "spanish"
13
+
14
  - "bne"
15
+
16
  - "capitel"
17
+
18
  - "pos"
19
+
20
  datasets:
21
+
22
  - "bne"
23
+
24
+ - "capitel"
25
+
26
  metrics:
27
+
28
  - "f1"
29
+
30
+ inference:
31
+ parameters:
32
+ aggregation_strategy: "first"
33
+
34
+ model-index:
35
+ - name: roberta-large-bne-capiter-pos
36
+ results:
37
+ - task:
38
+ type: token-classification
39
+ dataset:
40
+ type: pos
41
+ name: CAPITEL-POS
42
+ metrics:
43
+ - name: F1
44
+ type: f1
45
+ value: 0.986
46
+
47
  widget:
48
  - text: "Festival de San Sebastián: Johnny Depp recibirá el premio Donostia en pleno rifirrafe judicial con Amber Heard"
49
  - text: "El alcalde de Vigo, Abel Caballero, ha comenzado a colocar las luces de Navidad en agosto."
50
  - text: "Gracias a los datos de la BNE, se ha podido lograr este modelo del lenguaje."
 
 
 
 
51
 
52
  ---
53
 
 
62
  - [How to use](#how-to-use)
63
  - [Limitations and bias](#limitations-and-bias)
64
  - [Training](#training)
65
+ - [Training](#training)
66
+ - [Training data](#training-data)
67
+ - [Training procedure](#training-procedure)
68
  - [Evaluation](#evaluation)
69
+ - [Evaluation](#evaluation)
70
+ - [Variable and metrics](#variable-and-metrics)
71
+ - [Evaluation results](#evaluation-results)
72
  - [Additional information](#additional-information)
73
  - [Author](#author)
74
  - [Contact information](#contact-information)
 
81
  </details>
82
 
83
  ## Model description
84
+ The **roberta-large-bne-capitel-pos** is a Part-of-speech-tagging (POS) model for the Spanish language fine-tuned from the [roberta-large-bne](https://huggingface.co/PlanTL-GOB-ES/roberta-large-bne) model, a [RoBERTa](https://arxiv.org/abs/1907.11692) large model pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text, processed for this work, compiled from the web crawlings performed by the [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) from 2009 to 2019.
 
 
85
 
86
 
87
+ # Intended uses and limitations
88
+ **roberta-large-bne-capitel-pos** model can be used to Part-of-speech-tagging (POS) a text. The model is limited by its training dataset and may not generalize well for all use cases.
89
 
90
  ## How to use
91
 
92
+ Here is how to use this model:
93
+
94
+ ```python
95
+ from transformers import pipeline
96
+ from pprint import pprint
97
+
98
+ nlp = pipeline("token-classification", model="PlanTL-GOB-ES/roberta-large-bne-capitel-pos")
99
+ example = "El alcalde de Vigo, Abel Caballero, ha comenzado a colocar las luces de Navidad en agosto."
100
+
101
+ pos_results = nlp(example)
102
+ pprint(pos_results)
103
+ ```
104
+
105
  ## Limitations and bias
106
  At the time of submission, no measures have been taken to estimate the bias embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
107
 
108
  ## Training
 
109
  The dataset used is the one from the [CAPITEL competition at IberLEF 2020](https://sites.google.com/view/capitel2020) (sub-task 2).
110
 
111
+ ### Training procedure
112
+ The model was trained with a batch size of 16 and a learning rate of 3e-5 for 5 epochs. We then selected the best checkpoint using the downstream task metric in the corresponding development set and then evaluated it on the test set.
113
+
114
+ ## Evaluation
115
+
116
+ ### Variable and metrics
117
+ This model was finetuned maximizing F1 score.
118
 
119
+ ## Evaluation results
120
+ We evaluated the **roberta-large-bne-capitel-pos** on the CAPITEL-POS test set against standard multilingual and monolingual baselines:
121
 
122
 
123
+ | Model | CAPITEL-POS (F1) |
124
+ | ------------|:----|
125
+ | roberta-large-bne-capitel-pos | **98.56** |
126
+ | roberta-base-bne-capitel-pos | 98.46 |
127
+ | BETO | 98.36 |
128
+ | mBERT | 98.39 |
129
+ | BERTIN | 98.47 |
130
+ | ELECTRA | 98.16 |
131
+
132
+ For more details, check the fine-tuning and evaluation scripts in the official [GitHub repository](https://github.com/PlanTL-GOB-ES/lm-spanish).
133
+
134
  ## Additional information
135
 
136
  ### Author
 
168
  }
169
 
170
  ```
171
+
172
+
173
  ### Disclaimer
174
 
175
  The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.