license: apache-2.0
language:
- es
pipeline_tag: feature-extraction
tags:
- bert
- biomedical
- lexical semantics
- bionlp
- embedding
- entity linking
- umls
SapBERT-biomedical-clinical model for Spanish
Table of contents
Click to expand
Model description
SapBERT model in Spanish trained with a procedure similar to that described by Liu et al. (2020). The model has been trained with the Spanish data from UMLS 2023AA, using PlanTL-GOB-ES/roberta-base-biomedical-clinical-es as the base model.
Intended uses and limitations
The model is prepared to provide a numerical representation of biomedical concepts in UMLS. This allows using the embeddings generated by the model for semantic similarity tasks of biomedical concepts or entity linking tasks, among others.
How to use
The following script taken and adapted from the original SapBERT model converts a list of strings (entity names) into embeddings.
import numpy as np
import torch
from tqdm.auto import tqdm
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("BSC-NLP4BIA/SapBERT-parents-from-roberta-base-biomedical-clinical-es")
model = AutoModel.from_pretrained("BSC-NLP4BIA/SapBERT-parents-from-roberta-base-biomedical-clinical-es").cuda()
# replace with your own list of entity names in spanish
all_names = ["cancer de pulmón", "fiebre", "cirugía torácica"]
bs = 128 # batch size during inference
all_embs = []
for i in tqdm(np.arange(0, len(all_names), bs)):
toks = tokenizer.batch_encode_plus(all_names[i:i+bs],
padding="max_length",
max_length=25,
truncation=True,
return_tensors="pt")
toks_cuda = {}
for k,v in toks.items():
toks_cuda[k] = v.cuda()
cls_rep = model(**toks_cuda)[0][:,0,:] # use CLS representation as the embedding
all_embs.append(cls_rep.cpu().detach().numpy())
all_embs = np.concatenate(all_embs, axis=0)
For more details about training and eval, see SapBERT github repo.
Training
The training was performed using the original SapBERT training repository. As training data, the Spanish entries in UMLS were used, as well as the commercial names of the drugs (although they are in English), transformed to lowercase. To train the model, a set of 15 pairs of synonymous terms has been generated for each UMLS concept, we have considered as synonyms the lexical entries of each concept.
Evaluation
Evaluation of the results of using this model are in: Gallego, F., López-García, G., Gasco-Sánchez, L., Krallinger, M., & Veredas, F. J. (2024, June). Clinlinker: Medical entity linking of clinical concept mentions in spanish. In International Conference on Computational Science (pp. 266-280). Cham: Springer Nature Switzerland.
Additional information
Author
NLP4BIA at the Barcelona Supercomputing Center
Licensing information
Citation information
@inproceedings{gallego2024clinlinker, title={Clinlinker: Medical entity linking of clinical concept mentions in spanish}, author={Gallego, Fernando and L{'o}pez-Garc{'\i}a, Guillermo and Gasco-S{'a}nchez, Luis and Krallinger, Martin and Veredas, Francisco J}, booktitle={International Conference on Computational Science}, pages={266--280}, year={2024}, organization={Springer} }
Disclaimer
Click to expand
The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.