mmarimon commited on
Commit
c1f93d4
1 Parent(s): e01cca2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -31
README.md CHANGED
@@ -19,21 +19,20 @@ widget:
19
  <details>
20
  <summary>Click to expand</summary>
21
 
22
- - [Model Description](#model-description)
23
- - [Intended Uses and Limitations](#intended-use)
24
- - [How to Use](#how-to-use)
25
  - [Limitations and bias](#limitations-and-bias)
26
  - [Training](#training)
27
  - [Tokenization and model pretraining](#Tokenization-pretraining)
28
  - [Training corpora and preprocessing](#training-corpora-preprocessing)
29
- - [Evaluation and results](#evaluation)
30
- - [Additional Information](#additional-information)
31
- - [Contact Information](#contact-information)
 
32
  - [Copyright](#copyright)
33
- - [Licensing Information](#licensing-information)
34
  - [Funding](#funding)
35
- - [Citation Information](#citation-information)
36
- - [Contributions](#contributions)
37
  - [Disclaimer](#disclaimer)
38
 
39
  </details>
@@ -41,26 +40,18 @@ widget:
41
  ## Model description
42
  Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-SANIDAD/lm-biomedical-clinical-es) and read our [preprint](https://arxiv.org/abs/2109.03570) "_Carrino, C. P., Armengol-Estapé, J., Gutiérrez-Fandiño, A., Llop-Palao, J., Pàmies, M., Gonzalez-Agirre, A., & Villegas, M. (2021). Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario._".
43
 
44
- ## Intended uses & limitations
45
-
46
- The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
47
-
48
- However, the is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
49
 
50
 
51
  ## How to use
52
 
53
  ```python
54
  from transformers import AutoTokenizer, AutoModelForMaskedLM
55
-
56
  tokenizer = AutoTokenizer.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
57
-
58
  model = AutoModelForMaskedLM.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
59
-
60
  from transformers import pipeline
61
-
62
  unmasker = pipeline('fill-mask', model="BSC-TeMU/roberta-base-biomedical-es")
63
-
64
  unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
65
  ```
66
  ```
@@ -105,6 +96,7 @@ unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
105
 
106
  This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
107
  **biomedical** corpus in Spanish collected from several sources (see next section).
 
108
  The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
109
  used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens. The pretraining consists of a masked language model training at the subword level following the approach employed for the RoBERTa base model with the same hyperparameters as in the original work. The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM, using Adam optimizer with a peak learning rate of 0.0005 and an effective batch size of 2,048 sentences.
110
 
@@ -139,8 +131,7 @@ The result is a medium-size biomedical corpus for Spanish composed of about 963M
139
 
140
 
141
 
142
- ## Evaluation and results
143
-
144
  The model has been evaluated on the Named Entity Recognition (NER) using the following datasets:
145
 
146
  - [PharmaCoNER](https://zenodo.org/record/4270158): is a track on chemical and drug mention recognition from Spanish medical texts (for more info see: https://temu.bsc.es/pharmaconer/).
@@ -161,24 +152,23 @@ The evaluation results are compared against the [mBERT](https://huggingface.co/b
161
 
162
  ## Additional information
163
 
164
- ### Contact Information
 
165
 
 
166
  For further information, send an email to <[email protected]>
167
 
168
  ### Copyright
169
-
170
  Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
171
 
172
  ### Licensing information
173
-
174
  [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
175
 
176
  ### Funding
177
-
178
  This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
179
 
180
 
181
- ## Citation Information
182
  If you use our models, please cite our latest preprint:
183
 
184
  ```bibtex
@@ -210,11 +200,6 @@ If you use our Medical Crawler corpus, please cite the preprint:
210
  ```
211
 
212
 
213
- ### Contributions
214
-
215
- [N/A]
216
-
217
-
218
  ### Disclaimer
219
 
220
  <details>
 
19
  <details>
20
  <summary>Click to expand</summary>
21
 
22
+ - [Model description](#model-description)
23
+ - [Intended uses and limitations](#intended-use)
24
+ - [How to use](#how-to-use)
25
  - [Limitations and bias](#limitations-and-bias)
26
  - [Training](#training)
27
  - [Tokenization and model pretraining](#Tokenization-pretraining)
28
  - [Training corpora and preprocessing](#training-corpora-preprocessing)
29
+ - [Evaluation](#evaluation)
30
+ - [Additional information](#additional-information)
31
+ - [Author](#author)
32
+ - [Contact information](#contact-information)
33
  - [Copyright](#copyright)
34
+ - [Licensing information](#licensing-information)
35
  - [Funding](#funding)
 
 
36
  - [Disclaimer](#disclaimer)
37
 
38
  </details>
 
40
  ## Model description
41
  Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-SANIDAD/lm-biomedical-clinical-es) and read our [preprint](https://arxiv.org/abs/2109.03570) "_Carrino, C. P., Armengol-Estapé, J., Gutiérrez-Fandiño, A., Llop-Palao, J., Pàmies, M., Gonzalez-Agirre, A., & Villegas, M. (2021). Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario._".
42
 
43
+ ## Intended uses and limitations
44
+ The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section). However, it is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
 
 
 
45
 
46
 
47
  ## How to use
48
 
49
  ```python
50
  from transformers import AutoTokenizer, AutoModelForMaskedLM
 
51
  tokenizer = AutoTokenizer.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
 
52
  model = AutoModelForMaskedLM.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
 
53
  from transformers import pipeline
 
54
  unmasker = pipeline('fill-mask', model="BSC-TeMU/roberta-base-biomedical-es")
 
55
  unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
56
  ```
57
  ```
 
96
 
97
  This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
98
  **biomedical** corpus in Spanish collected from several sources (see next section).
99
+
100
  The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
101
  used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens. The pretraining consists of a masked language model training at the subword level following the approach employed for the RoBERTa base model with the same hyperparameters as in the original work. The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM, using Adam optimizer with a peak learning rate of 0.0005 and an effective batch size of 2,048 sentences.
102
 
 
131
 
132
 
133
 
134
+ ## Evaluation
 
135
  The model has been evaluated on the Named Entity Recognition (NER) using the following datasets:
136
 
137
  - [PharmaCoNER](https://zenodo.org/record/4270158): is a track on chemical and drug mention recognition from Spanish medical texts (for more info see: https://temu.bsc.es/pharmaconer/).
 
152
 
153
  ## Additional information
154
 
155
+ ### Author
156
+ Text Mining Unit (TeMU) at the Barcelona Supercomputing Center ([email protected])
157
 
158
+ ### Contact information
159
  For further information, send an email to <[email protected]>
160
 
161
  ### Copyright
 
162
  Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
163
 
164
  ### Licensing information
 
165
  [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
166
 
167
  ### Funding
 
168
  This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
169
 
170
 
171
+ ## Citation information
172
  If you use our models, please cite our latest preprint:
173
 
174
  ```bibtex
 
200
  ```
201
 
202
 
 
 
 
 
 
203
  ### Disclaimer
204
 
205
  <details>