BSC-NLP4BIA
/

SapBERT-from-roberta-base-biomedical-clinical-es

 - embedding
 - entity linking
 - umls
+---
+# SapBERT-biomedical-clinical model for Spanish
+## Table of contents
+<details>
+<summary>Click to expand</summary>
+- [Model description](#model-description)
+- [Intended uses and limitations](#intended-use)
+- [How to use](#how-to-use)
+- [Training](#training)
+- [Evaluation](#evaluation)
+- [Additional information](#additional-information)
+  - [Author](#author)
+  - [Licensing information](#licensing-information)
+  - [Citation information](#citation-information)
+  - [Disclaimer](#disclaimer)
+</details>
+## Model description
+SapBERT model in Spanish trained with a procedure similar to that described by [Liu et al. (2020)](https://arxiv.org/pdf/2010.11784.pdf). The model has been trained with the Spanish data from [UMLS](https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html) 2023AA, using [PlanTL-GOB-ES/roberta-base-biomedical-clinical-es](https://huggingface.co/PlanTL-GOB-ES/roberta-base-biomedical-clinical-es) as the base model.
+## Intended uses and limitations
+The model is prepared to provide a numerical representation of biomedical concepts in UMLS. This allows using the embeddings generated by the model for semantic similarity tasks of biomedical concepts or entity linking tasks, among others.
+## How to use
+The following script taken and adapted from the [original SapBERT model](https://huggingface.co/cambridgeltl/SapBERT-from-PubMedBERT-fulltext/) converts a list of strings (entity names) into embeddings.
+```python
+import numpy as np
+import torch
+from tqdm.auto import tqdm
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("BSC-NLP4BIA/SapBERT-parents-from-roberta-base-biomedical-clinical-es")
+model = AutoModel.from_pretrained("BSC-NLP4BIA/SapBERT-parents-from-roberta-base-biomedical-clinical-es").cuda()
+# replace with your own list of entity names in spanish
+all_names = ["cancer de pulmón", "fiebre", "cirugía torácica"]
+bs = 128 # batch size during inference
+all_embs = []
+for i in tqdm(np.arange(0, len(all_names), bs)):
+    toks = tokenizer.batch_encode_plus(all_names[i:i+bs],
+                                       padding="max_length",
+                                       max_length=25,
+                                       truncation=True,
+                                       return_tensors="pt")
+    toks_cuda = {}
+    for k,v in toks.items():
+        toks_cuda[k] = v.cuda()
+    cls_rep = model(**toks_cuda)[0][:,0,:] # use CLS representation as the embedding
+    all_embs.append(cls_rep.cpu().detach().numpy())
+all_embs = np.concatenate(all_embs, axis=0)
+```
+For more details about training and eval, see SapBERT [github repo](https://github.com/cambridgeltl/sapbert).
+## Training
+The training was performed using the [original SapBERT training repository](https://github.com/cambridgeltl/sapbert). As training data, the Spanish entries in UMLS were used, as well as the commercial names of the drugs (although they are in English), transformed to lowercase. To train the model, a set of 15 pairs of synonymous terms has been generated for each UMLS concept, we have considered as synonyms the lexical entries of each concept.
+## Evaluation
+To be published
+## Additional information
+### Author
+NLP4BIA at the Barcelona Supercomputing Center
+### Licensing information
+[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
+### Citation information
+To be published
+### Disclaimer
+<details>
+<summary>Click to expand</summary>
+The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
+When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
+</details>