Update README.md
Browse files
README.md
CHANGED
@@ -19,21 +19,20 @@ widget:
|
|
19 |
<details>
|
20 |
<summary>Click to expand</summary>
|
21 |
|
22 |
-
- [Model
|
23 |
-
- [Intended
|
24 |
-
- [How to
|
25 |
- [Limitations and bias](#limitations-and-bias)
|
26 |
- [Training](#training)
|
27 |
- [Tokenization and model pretraining](#Tokenization-pretraining)
|
28 |
- [Training corpora and preprocessing](#training-corpora-preprocessing)
|
29 |
-
- [Evaluation
|
30 |
-
- [Additional
|
31 |
-
- [
|
|
|
32 |
- [Copyright](#copyright)
|
33 |
-
- [Licensing
|
34 |
- [Funding](#funding)
|
35 |
-
- [Citation Information](#citation-information)
|
36 |
-
- [Contributions](#contributions)
|
37 |
- [Disclaimer](#disclaimer)
|
38 |
|
39 |
</details>
|
@@ -41,26 +40,18 @@ widget:
|
|
41 |
## Model description
|
42 |
Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-SANIDAD/lm-biomedical-clinical-es) and read our [preprint](https://arxiv.org/abs/2109.03570) "_Carrino, C. P., Armengol-Estapé, J., Gutiérrez-Fandiño, A., Llop-Palao, J., Pàmies, M., Gonzalez-Agirre, A., & Villegas, M. (2021). Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario._".
|
43 |
|
44 |
-
## Intended uses
|
45 |
-
|
46 |
-
The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
|
47 |
-
|
48 |
-
However, the is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
|
49 |
|
50 |
|
51 |
## How to use
|
52 |
|
53 |
```python
|
54 |
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
55 |
-
|
56 |
tokenizer = AutoTokenizer.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
|
57 |
-
|
58 |
model = AutoModelForMaskedLM.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
|
59 |
-
|
60 |
from transformers import pipeline
|
61 |
-
|
62 |
unmasker = pipeline('fill-mask', model="BSC-TeMU/roberta-base-biomedical-es")
|
63 |
-
|
64 |
unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
|
65 |
```
|
66 |
```
|
@@ -105,6 +96,7 @@ unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
|
|
105 |
|
106 |
This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
|
107 |
**biomedical** corpus in Spanish collected from several sources (see next section).
|
|
|
108 |
The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
|
109 |
used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens. The pretraining consists of a masked language model training at the subword level following the approach employed for the RoBERTa base model with the same hyperparameters as in the original work. The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM, using Adam optimizer with a peak learning rate of 0.0005 and an effective batch size of 2,048 sentences.
|
110 |
|
@@ -139,8 +131,7 @@ The result is a medium-size biomedical corpus for Spanish composed of about 963M
|
|
139 |
|
140 |
|
141 |
|
142 |
-
## Evaluation
|
143 |
-
|
144 |
The model has been evaluated on the Named Entity Recognition (NER) using the following datasets:
|
145 |
|
146 |
- [PharmaCoNER](https://zenodo.org/record/4270158): is a track on chemical and drug mention recognition from Spanish medical texts (for more info see: https://temu.bsc.es/pharmaconer/).
|
@@ -161,24 +152,23 @@ The evaluation results are compared against the [mBERT](https://huggingface.co/b
|
|
161 |
|
162 |
## Additional information
|
163 |
|
164 |
-
###
|
|
|
165 |
|
|
|
166 |
For further information, send an email to <[email protected]>
|
167 |
|
168 |
### Copyright
|
169 |
-
|
170 |
Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
|
171 |
|
172 |
### Licensing information
|
173 |
-
|
174 |
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
175 |
|
176 |
### Funding
|
177 |
-
|
178 |
This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
|
179 |
|
180 |
|
181 |
-
## Citation
|
182 |
If you use our models, please cite our latest preprint:
|
183 |
|
184 |
```bibtex
|
@@ -210,11 +200,6 @@ If you use our Medical Crawler corpus, please cite the preprint:
|
|
210 |
```
|
211 |
|
212 |
|
213 |
-
### Contributions
|
214 |
-
|
215 |
-
[N/A]
|
216 |
-
|
217 |
-
|
218 |
### Disclaimer
|
219 |
|
220 |
<details>
|
|
|
19 |
<details>
|
20 |
<summary>Click to expand</summary>
|
21 |
|
22 |
+
- [Model description](#model-description)
|
23 |
+
- [Intended uses and limitations](#intended-use)
|
24 |
+
- [How to use](#how-to-use)
|
25 |
- [Limitations and bias](#limitations-and-bias)
|
26 |
- [Training](#training)
|
27 |
- [Tokenization and model pretraining](#Tokenization-pretraining)
|
28 |
- [Training corpora and preprocessing](#training-corpora-preprocessing)
|
29 |
+
- [Evaluation](#evaluation)
|
30 |
+
- [Additional information](#additional-information)
|
31 |
+
- [Author](#author)
|
32 |
+
- [Contact information](#contact-information)
|
33 |
- [Copyright](#copyright)
|
34 |
+
- [Licensing information](#licensing-information)
|
35 |
- [Funding](#funding)
|
|
|
|
|
36 |
- [Disclaimer](#disclaimer)
|
37 |
|
38 |
</details>
|
|
|
40 |
## Model description
|
41 |
Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-SANIDAD/lm-biomedical-clinical-es) and read our [preprint](https://arxiv.org/abs/2109.03570) "_Carrino, C. P., Armengol-Estapé, J., Gutiérrez-Fandiño, A., Llop-Palao, J., Pàmies, M., Gonzalez-Agirre, A., & Villegas, M. (2021). Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario._".
|
42 |
|
43 |
+
## Intended uses and limitations
|
44 |
+
The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section). However, it is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
|
|
|
|
|
|
|
45 |
|
46 |
|
47 |
## How to use
|
48 |
|
49 |
```python
|
50 |
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
|
|
51 |
tokenizer = AutoTokenizer.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
|
|
|
52 |
model = AutoModelForMaskedLM.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
|
|
|
53 |
from transformers import pipeline
|
|
|
54 |
unmasker = pipeline('fill-mask', model="BSC-TeMU/roberta-base-biomedical-es")
|
|
|
55 |
unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
|
56 |
```
|
57 |
```
|
|
|
96 |
|
97 |
This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
|
98 |
**biomedical** corpus in Spanish collected from several sources (see next section).
|
99 |
+
|
100 |
The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
|
101 |
used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens. The pretraining consists of a masked language model training at the subword level following the approach employed for the RoBERTa base model with the same hyperparameters as in the original work. The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM, using Adam optimizer with a peak learning rate of 0.0005 and an effective batch size of 2,048 sentences.
|
102 |
|
|
|
131 |
|
132 |
|
133 |
|
134 |
+
## Evaluation
|
|
|
135 |
The model has been evaluated on the Named Entity Recognition (NER) using the following datasets:
|
136 |
|
137 |
- [PharmaCoNER](https://zenodo.org/record/4270158): is a track on chemical and drug mention recognition from Spanish medical texts (for more info see: https://temu.bsc.es/pharmaconer/).
|
|
|
152 |
|
153 |
## Additional information
|
154 |
|
155 |
+
### Author
|
156 |
+
Text Mining Unit (TeMU) at the Barcelona Supercomputing Center ([email protected])
|
157 |
|
158 |
+
### Contact information
|
159 |
For further information, send an email to <[email protected]>
|
160 |
|
161 |
### Copyright
|
|
|
162 |
Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
|
163 |
|
164 |
### Licensing information
|
|
|
165 |
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
166 |
|
167 |
### Funding
|
|
|
168 |
This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
|
169 |
|
170 |
|
171 |
+
## Citation information
|
172 |
If you use our models, please cite our latest preprint:
|
173 |
|
174 |
```bibtex
|
|
|
200 |
```
|
201 |
|
202 |
|
|
|
|
|
|
|
|
|
|
|
203 |
### Disclaimer
|
204 |
|
205 |
<details>
|