Update README.md
Browse files
README.md
CHANGED
@@ -14,7 +14,7 @@ widget:
|
|
14 |
---
|
15 |
|
16 |
# Biomedical language model for Spanish
|
17 |
-
Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es)
|
18 |
|
19 |
|
20 |
## Tokenization and model pretraining
|
@@ -37,20 +37,20 @@ To obtain a high-quality training corpus, a cleaning pipeline with the following
|
|
37 |
- deduplication of repetitive contents
|
38 |
- keep the original document boundaries
|
39 |
|
40 |
-
Finally, the corpora are concatenated and further global deduplication among the corpora
|
41 |
The result is a medium-size biomedical corpus for Spanish composed of about 963M tokens. The table below shows some basic statistics of the individual cleaned corpora:
|
42 |
|
43 |
|
44 |
| Name | No. tokens | Description |
|
45 |
|-----------------------------------------------------------------------------------------|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
46 |
-
| [Medical crawler](https://zenodo.org/record/4561970) |
|
47 |
| Clinical cases misc. | 102,855,267 | A miscellany of medical content, essentially clinical cases. Note that a clinical case report is a scientific publication where medical practitioners share patient cases and it is different from a clinical note or document. |
|
48 |
| [Scielo](https://github.com/PlanTL-GOB-ES/SciELO-Spain-Crawler) | 60,007,289 | Publications written in Spanish crawled from the Spanish SciELO server in 2017. |
|
49 |
| [BARR2_background](https://temu.bsc.es/BARR2/downloads/background_set.raw_text.tar.bz2) | 24,516,442 | Biomedical Abbreviation Recognition and Resolution (BARR2) containing Spanish clinical case study sections from a variety of clinical disciplines. |
|
50 |
| Wikipedia_life_sciences | 13,890,501 | Wikipedia articles crawled 04/01/2021 with the [Wikipedia API python library](https://pypi.org/project/Wikipedia-API/) starting from the "Ciencias\_de\_la\_vida" category up to a maximum of 5 subcategories. Multiple links to the same articles are then discarded to avoid repeating content. |
|
51 |
| Patents | 13,463,387 | Google Patent in Medical Domain for Spain (Spanish). The accepted codes (Medical Domain) for Json files of patents are: "A61B", "A61C","A61F", "A61H", "A61K", "A61L","A61M", "A61B", "A61P". |
|
52 |
| [EMEA](http://opus.nlpl.eu/download.php?f=EMEA/v3/moses/en-es.txt.zip) | 5,377,448 | Spanish-side documents extracted from parallel corpora made out of PDF documents from the European Medicines Agency. |
|
53 |
-
| [mespen_Medline](https://zenodo.org/record/3562536#.YTt1fH2xXbR) | 4,166,077 | Spanish-side articles extracted from a collection of Spanish-English parallel corpus consisting of biomedical scientific literature. The collection of parallel resources
|
54 |
| PubMed | 1,858,966 | Open-access articles from the PubMed repository crawled in 2017. |
|
55 |
|
56 |
|
@@ -81,35 +81,7 @@ The model is ready-to-use only for masked language modelling to perform the Fill
|
|
81 |
However, the is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
|
82 |
|
83 |
## Cite
|
84 |
-
|
85 |
-
|
86 |
-
```bibtex
|
87 |
-
|
88 |
-
@misc{carrino2021biomedical,
|
89 |
-
title={Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario},
|
90 |
-
author={Casimiro Pio Carrino and Jordi Armengol-Estapé and Asier Gutiérrez-Fandiño and Joan Llop-Palao and Marc Pàmies and Aitor Gonzalez-Agirre and Marta Villegas},
|
91 |
-
year={2021},
|
92 |
-
eprint={2109.03570},
|
93 |
-
archivePrefix={arXiv},
|
94 |
-
primaryClass={cs.CL}
|
95 |
-
}
|
96 |
-
|
97 |
-
```
|
98 |
-
|
99 |
-
If you use our Medical Crawler corpus, please cite the preprint:
|
100 |
-
|
101 |
-
```bibtex
|
102 |
-
|
103 |
-
@misc{carrino2021spanish,
|
104 |
-
title={Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models},
|
105 |
-
author={Casimiro Pio Carrino and Jordi Armengol-Estapé and Ona de Gibert Bonet and Asier Gutiérrez-Fandiño and Aitor Gonzalez-Agirre and Martin Krallinger and Marta Villegas},
|
106 |
-
year={2021},
|
107 |
-
eprint={2109.07765},
|
108 |
-
archivePrefix={arXiv},
|
109 |
-
primaryClass={cs.CL}
|
110 |
-
}
|
111 |
-
|
112 |
-
```
|
113 |
|
114 |
---
|
115 |
|
|
|
14 |
---
|
15 |
|
16 |
# Biomedical language model for Spanish
|
17 |
+
Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es).
|
18 |
|
19 |
|
20 |
## Tokenization and model pretraining
|
|
|
37 |
- deduplication of repetitive contents
|
38 |
- keep the original document boundaries
|
39 |
|
40 |
+
Finally, the corpora are concatenated and further global deduplication among the corpora has been applied.
|
41 |
The result is a medium-size biomedical corpus for Spanish composed of about 963M tokens. The table below shows some basic statistics of the individual cleaned corpora:
|
42 |
|
43 |
|
44 |
| Name | No. tokens | Description |
|
45 |
|-----------------------------------------------------------------------------------------|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
46 |
+
| [Medical crawler](https://zenodo.org/record/4561970) | 903,558,13 | Crawler of more than 3,000 URLs belonging to Spanish biomedical and health domains. |
|
47 |
| Clinical cases misc. | 102,855,267 | A miscellany of medical content, essentially clinical cases. Note that a clinical case report is a scientific publication where medical practitioners share patient cases and it is different from a clinical note or document. |
|
48 |
| [Scielo](https://github.com/PlanTL-GOB-ES/SciELO-Spain-Crawler) | 60,007,289 | Publications written in Spanish crawled from the Spanish SciELO server in 2017. |
|
49 |
| [BARR2_background](https://temu.bsc.es/BARR2/downloads/background_set.raw_text.tar.bz2) | 24,516,442 | Biomedical Abbreviation Recognition and Resolution (BARR2) containing Spanish clinical case study sections from a variety of clinical disciplines. |
|
50 |
| Wikipedia_life_sciences | 13,890,501 | Wikipedia articles crawled 04/01/2021 with the [Wikipedia API python library](https://pypi.org/project/Wikipedia-API/) starting from the "Ciencias\_de\_la\_vida" category up to a maximum of 5 subcategories. Multiple links to the same articles are then discarded to avoid repeating content. |
|
51 |
| Patents | 13,463,387 | Google Patent in Medical Domain for Spain (Spanish). The accepted codes (Medical Domain) for Json files of patents are: "A61B", "A61C","A61F", "A61H", "A61K", "A61L","A61M", "A61B", "A61P". |
|
52 |
| [EMEA](http://opus.nlpl.eu/download.php?f=EMEA/v3/moses/en-es.txt.zip) | 5,377,448 | Spanish-side documents extracted from parallel corpora made out of PDF documents from the European Medicines Agency. |
|
53 |
+
| [mespen_Medline](https://zenodo.org/record/3562536#.YTt1fH2xXbR) | 4,166,077 | Spanish-side articles extracted from a collection of Spanish-English parallel corpus consisting of biomedical scientific literature. The collection of parallel resources is aggregated from the MedlinePlus source. |
|
54 |
| PubMed | 1,858,966 | Open-access articles from the PubMed repository crawled in 2017. |
|
55 |
|
56 |
|
|
|
81 |
However, the is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
|
82 |
|
83 |
## Cite
|
84 |
+
To be announced soon.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
85 |
|
86 |
---
|
87 |
|