PlanTL-GOB-ES
/

roberta-base-biomedical-es

@@ -14,7 +14,7 @@ widget:
 ---
 # Biomedical language model for Spanish
-Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es) and read our [preprint](https://arxiv.org/abs/2109.03570) "_Carrino, C. P., Armengol-Estapé, J., Gutiérrez-Fandiño, A., Llop-Palao, J., Pàmies, M., Gonzalez-Agirre, A., & Villegas, M. (2021). Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario._".
 ## Tokenization and model pretraining
@@ -37,20 +37,20 @@ To obtain a high-quality training corpus, a cleaning pipeline with the following
   - deduplication of repetitive contents
   - keep the original document boundaries
-Finally, the corpora are concatenated and further global deduplication among the corpora have been applied.
 The result is a medium-size biomedical corpus for Spanish composed of about 963M tokens. The table below shows some basic statistics of the individual cleaned corpora:
 | Name                                                                                    | No. tokens  | Description                                                                                                                                                                                                                                          |
 |-----------------------------------------------------------------------------------------|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| [Medical crawler](https://zenodo.org/record/4561970)                                    | 745,705,946 | Crawler of more than 3,000 URLs belonging to Spanish biomedical and health domains.                                                                                                                                                                                 |
 | Clinical cases misc.                                                                    | 102,855,267 | A miscellany of medical content, essentially clinical cases. Note that a clinical case report is a scientific publication where medical practitioners share patient cases and it is different from a clinical note or document.                                                                                                                                                                                 |
 | [Scielo](https://github.com/PlanTL-GOB-ES/SciELO-Spain-Crawler)                        | 60,007,289  | Publications written in Spanish crawled from the Spanish SciELO server in 2017.                                                                                                                                       |
 | [BARR2_background](https://temu.bsc.es/BARR2/downloads/background_set.raw_text.tar.bz2) | 24,516,442  | Biomedical Abbreviation Recognition and Resolution (BARR2) containing Spanish clinical case study sections from a variety of clinical disciplines.                                                                                       |
 | Wikipedia_life_sciences                                                                 | 13,890,501  | Wikipedia articles crawled 04/01/2021 with the [Wikipedia API python library](https://pypi.org/project/Wikipedia-API/) starting from the "Ciencias\_de\_la\_vida" category up to a maximum of 5 subcategories. Multiple links to the same articles are then discarded to avoid repeating content.                                                                                                                                                                    |
 | Patents                                                                                 | 13,463,387  | Google Patent in Medical Domain for Spain (Spanish). The accepted codes (Medical Domain) for Json files of patents are: "A61B", "A61C","A61F", "A61H", "A61K", "A61L","A61M", "A61B", "A61P".                                                        |
 | [EMEA](http://opus.nlpl.eu/download.php?f=EMEA/v3/moses/en-es.txt.zip)                  | 5,377,448   | Spanish-side documents extracted from parallel corpora made out of PDF documents from the European Medicines Agency.                                                                                                                            |
-| [mespen_Medline](https://zenodo.org/record/3562536#.YTt1fH2xXbR)                        | 4,166,077   | Spanish-side articles extracted from a collection of Spanish-English parallel corpus consisting of biomedical scientific literature.  The collection of parallel resources are aggregated from the MedlinePlus source. |
 | PubMed                                                                                  | 1,858,966   | Open-access articles from the PubMed repository crawled in 2017.                                                                                                                                              |
@@ -81,35 +81,7 @@ The model is ready-to-use only for masked language modelling to perform the Fill
 However, the is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
 ## Cite
-If you use our models, please cite our latest preprint:
-```bibtex
-@misc{carrino2021biomedical,
-      title={Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario},
-      author={Casimiro Pio Carrino and Jordi Armengol-Estapé and Asier Gutiérrez-Fandiño and Joan Llop-Palao and Marc Pàmies and Aitor Gonzalez-Agirre and Marta Villegas},
-      year={2021},
-      eprint={2109.03570},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL}
-}
-```
-If you use our Medical Crawler corpus, please cite the preprint:
-```bibtex
-@misc{carrino2021spanish,
-      title={Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models},
-      author={Casimiro Pio Carrino and Jordi Armengol-Estapé and Ona de Gibert Bonet and Asier Gutiérrez-Fandiño and Aitor Gonzalez-Agirre and Martin Krallinger and Marta Villegas},
-      year={2021},
-      eprint={2109.07765},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL}
-}
-```
 ---

 ---
 # Biomedical language model for Spanish
+Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es).
 ## Tokenization and model pretraining
   - deduplication of repetitive contents
   - keep the original document boundaries
+Finally, the corpora are concatenated and further global deduplication among the corpora has been applied.
 The result is a medium-size biomedical corpus for Spanish composed of about 963M tokens. The table below shows some basic statistics of the individual cleaned corpora:
 | Name                                                                                    | No. tokens  | Description                                                                                                                                                                                                                                          |
 |-----------------------------------------------------------------------------------------|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| [Medical crawler](https://zenodo.org/record/4561970)                                    | 903,558,13 | Crawler of more than 3,000 URLs belonging to Spanish biomedical and health domains.                                                                                                                                                                                 |
 | Clinical cases misc.                                                                    | 102,855,267 | A miscellany of medical content, essentially clinical cases. Note that a clinical case report is a scientific publication where medical practitioners share patient cases and it is different from a clinical note or document.                                                                                                                                                                                 |
 | [Scielo](https://github.com/PlanTL-GOB-ES/SciELO-Spain-Crawler)                        | 60,007,289  | Publications written in Spanish crawled from the Spanish SciELO server in 2017.                                                                                                                                       |
 | [BARR2_background](https://temu.bsc.es/BARR2/downloads/background_set.raw_text.tar.bz2) | 24,516,442  | Biomedical Abbreviation Recognition and Resolution (BARR2) containing Spanish clinical case study sections from a variety of clinical disciplines.                                                                                       |
 | Wikipedia_life_sciences                                                                 | 13,890,501  | Wikipedia articles crawled 04/01/2021 with the [Wikipedia API python library](https://pypi.org/project/Wikipedia-API/) starting from the "Ciencias\_de\_la\_vida" category up to a maximum of 5 subcategories. Multiple links to the same articles are then discarded to avoid repeating content.                                                                                                                                                                    |
 | Patents                                                                                 | 13,463,387  | Google Patent in Medical Domain for Spain (Spanish). The accepted codes (Medical Domain) for Json files of patents are: "A61B", "A61C","A61F", "A61H", "A61K", "A61L","A61M", "A61B", "A61P".                                                        |
 | [EMEA](http://opus.nlpl.eu/download.php?f=EMEA/v3/moses/en-es.txt.zip)                  | 5,377,448   | Spanish-side documents extracted from parallel corpora made out of PDF documents from the European Medicines Agency.                                                                                                                            |
+| [mespen_Medline](https://zenodo.org/record/3562536#.YTt1fH2xXbR)                        | 4,166,077   | Spanish-side articles extracted from a collection of Spanish-English parallel corpus consisting of biomedical scientific literature.  The collection of parallel resources is aggregated from the MedlinePlus source. |
 | PubMed                                                                                  | 1,858,966   | Open-access articles from the PubMed repository crawled in 2017.                                                                                                                                              |
 However, the is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
 ## Cite
+To be announced soon.
 ---