PlanTL-GOB-ES
/

roberta-base-biomedical-es

@@ -14,10 +14,94 @@ widget:
 ---
 # Biomedical language model for Spanish
 Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-SANIDAD/lm-biomedical-clinical-es) and read our [preprint](https://arxiv.org/abs/2109.03570) "_Carrino, C. P., Armengol-Estapé, J., Gutiérrez-Fandiño, A., Llop-Palao, J., Pàmies, M., Gonzalez-Agirre, A., & Villegas, M. (2021). Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario._".
-## Tokenization and model pretraining
 This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
 **biomedical** corpus in Spanish collected from several sources (see next section).
@@ -25,7 +109,7 @@ The training corpus has been tokenized using a byte version of [Byte-Pair Encodi
 used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens. The pretraining consists of a masked language model training at the subword level following the approach employed for the RoBERTa base model with the same hyperparameters as in the original work. The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM, using Adam optimizer with a peak learning rate of 0.0005 and an effective batch size of 2,048 sentences.
-## Training corpora and preprocessing
 The training corpus is composed of several biomedical corpora in Spanish, collected from publicly available corpora and crawlers.
 To obtain a high-quality training corpus, a cleaning pipeline with the following operations has been applied:
@@ -74,13 +158,27 @@ The evaluation results are compared against the [mBERT](https://huggingface.co/b
 | ICTUSnet                  | **88.12** - **85.56** - **90.83**    | 86.75 - 83.53 - 90.23 | 85.95 - 83.10 - 89.02 |
-## Intended uses & limitations
-The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
-However, the is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
-## Cite
 If you use our models, please cite our latest preprint:
 ```bibtex
@@ -111,69 +209,11 @@ If you use our Medical Crawler corpus, please cite the preprint:
 ```
----
-## How to use
-```python
-from transformers import AutoTokenizer, AutoModelForMaskedLM
-tokenizer = AutoTokenizer.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
-model = AutoModelForMaskedLM.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
-from transformers import pipeline
-unmasker = pipeline('fill-mask', model="BSC-TeMU/roberta-base-biomedical-es")
-unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
-```
-```
-# Output
-[
-  {
-    "sequence": " El único antecedente personal a reseñar era la hipertensión arterial.",
-    "score": 0.9855039715766907,
-    "token": 3529,
-    "token_str": " hipertensión"
-  },
-  {
-    "sequence": " El único antecedente personal a reseñar era la diabetes arterial.",
-    "score": 0.0039140828885138035,
-    "token": 1945,
-    "token_str": " diabetes"
-  },
-  {
-    "sequence": " El único antecedente personal a reseñar era la hipotensión arterial.",
-    "score": 0.002484665485098958,
-    "token": 11483,
-    "token_str": " hipotensión"
-  },
-  {
-    "sequence": " El único antecedente personal a reseñar era la Hipertensión arterial.",
-    "score": 0.0023484621196985245,
-    "token": 12238,
-    "token_str": " Hipertensión"
-  },
-  {
-    "sequence": " El único antecedente personal a reseñar era la presión arterial.",
-    "score": 0.0008009297889657319,
-    "token": 2267,
-    "token_str": " presión"
-  }
-]
-```
-## Copyright
-Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
-## Licensing information
-[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
-## Funding
-This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
 ### Disclaimer

 ---
 # Biomedical language model for Spanish
+## Table of contents
+<details>
+<summary>Click to expand</summary>
+- [Model Description](#model-description)
+- [Intended Uses and Limitations](#intended-use)
+- [How to Use](#how-to-use)
+- [Limitations and bias](#limitations-and-bias)
+- [Training](#training)
+  - [Tokenization and model pretraining](#Tokenization-pretraining)
+  - [Training corpora and preprocessing](#training-corpora-preprocessing)
+- [Evaluation and results](#evaluation)
+- [Additional Information](#additional-information)
+  - [Contact Information](#contact-information)
+  - [Copyright](#copyright)
+  - [Licensing Information](#licensing-information)
+  - [Funding](#funding)
+  - [Citation Information](#citation-information)
+  - [Contributions](#contributions)
+  - [Disclaimer](#disclaimer)
+</details>
+## Model description
 Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-SANIDAD/lm-biomedical-clinical-es) and read our [preprint](https://arxiv.org/abs/2109.03570) "_Carrino, C. P., Armengol-Estapé, J., Gutiérrez-Fandiño, A., Llop-Palao, J., Pàmies, M., Gonzalez-Agirre, A., & Villegas, M. (2021). Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario._".
+## Intended uses & limitations
+The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
+However, the is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
+## How to use
+```python
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+tokenizer = AutoTokenizer.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
+model = AutoModelForMaskedLM.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
+from transformers import pipeline
+unmasker = pipeline('fill-mask', model="BSC-TeMU/roberta-base-biomedical-es")
+unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
+```
+```
+# Output
+[
+  {
+    "sequence": " El único antecedente personal a reseñar era la hipertensión arterial.",
+    "score": 0.9855039715766907,
+    "token": 3529,
+    "token_str": " hipertensión"
+  },
+  {
+    "sequence": " El único antecedente personal a reseñar era la diabetes arterial.",
+    "score": 0.0039140828885138035,
+    "token": 1945,
+    "token_str": " diabetes"
+  },
+  {
+    "sequence": " El único antecedente personal a reseñar era la hipotensión arterial.",
+    "score": 0.002484665485098958,
+    "token": 11483,
+    "token_str": " hipotensión"
+  },
+  {
+    "sequence": " El único antecedente personal a reseñar era la Hipertensión arterial.",
+    "score": 0.0023484621196985245,
+    "token": 12238,
+    "token_str": " Hipertensión"
+  },
+  {
+    "sequence": " El único antecedente personal a reseñar era la presión arterial.",
+    "score": 0.0008009297889657319,
+    "token": 2267,
+    "token_str": " presión"
+  }
+]
+```
+## Training
+### Tokenization and model pretraining
 This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
 **biomedical** corpus in Spanish collected from several sources (see next section).
 used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens. The pretraining consists of a masked language model training at the subword level following the approach employed for the RoBERTa base model with the same hyperparameters as in the original work. The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM, using Adam optimizer with a peak learning rate of 0.0005 and an effective batch size of 2,048 sentences.
+### Training corpora and preprocessing
 The training corpus is composed of several biomedical corpora in Spanish, collected from publicly available corpora and crawlers.
 To obtain a high-quality training corpus, a cleaning pipeline with the following operations has been applied:
 | ICTUSnet                  | **88.12** - **85.56** - **90.83**    | 86.75 - 83.53 - 90.23 | 85.95 - 83.10 - 89.02 |
+## Additional information
+### Contact Information
+For further information, send an email to <[email protected]>
+### Copyright
+Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
+### Licensing information
+[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
+### Funding
+This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
+## Citation Information
 If you use our models, please cite our latest preprint:
 ```bibtex
 ```
+### Contributions
+[N/A]
 ### Disclaimer