luisgasco commited on
Commit
4b9ac85
1 Parent(s): cdf5cca

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +92 -1
README.md CHANGED
@@ -11,4 +11,95 @@ tags:
11
  - embedding
12
  - entity linking
13
  - umls
14
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  - embedding
12
  - entity linking
13
  - umls
14
+ ---
15
+
16
+ # SapBERT-biomedical-clinical model for Spanish
17
+
18
+ ## Table of contents
19
+ <details>
20
+ <summary>Click to expand</summary>
21
+
22
+ - [Model description](#model-description)
23
+ - [Intended uses and limitations](#intended-use)
24
+ - [How to use](#how-to-use)
25
+ - [Training](#training)
26
+ - [Evaluation](#evaluation)
27
+ - [Additional information](#additional-information)
28
+ - [Author](#author)
29
+ - [Licensing information](#licensing-information)
30
+ - [Citation information](#citation-information)
31
+ - [Disclaimer](#disclaimer)
32
+
33
+ </details>
34
+
35
+ ## Model description
36
+ SapBERT model in Spanish trained with a procedure similar to that described by [Liu et al. (2020)](https://arxiv.org/pdf/2010.11784.pdf). The model has been trained with the Spanish data from [UMLS](https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html) 2023AA, using [PlanTL-GOB-ES/roberta-base-biomedical-clinical-es](https://huggingface.co/PlanTL-GOB-ES/roberta-base-biomedical-clinical-es) as the base model.
37
+
38
+ ## Intended uses and limitations
39
+ The model is prepared to provide a numerical representation of biomedical concepts in UMLS. This allows using the embeddings generated by the model for semantic similarity tasks of biomedical concepts or entity linking tasks, among others.
40
+
41
+
42
+ ## How to use
43
+
44
+ The following script taken and adapted from the [original SapBERT model](https://huggingface.co/cambridgeltl/SapBERT-from-PubMedBERT-fulltext/) converts a list of strings (entity names) into embeddings.
45
+
46
+ ```python
47
+ import numpy as np
48
+ import torch
49
+ from tqdm.auto import tqdm
50
+ from transformers import AutoTokenizer, AutoModel
51
+
52
+ tokenizer = AutoTokenizer.from_pretrained("BSC-NLP4BIA/SapBERT-parents-from-roberta-base-biomedical-clinical-es")
53
+ model = AutoModel.from_pretrained("BSC-NLP4BIA/SapBERT-parents-from-roberta-base-biomedical-clinical-es").cuda()
54
+
55
+ # replace with your own list of entity names in spanish
56
+ all_names = ["cancer de pulmón", "fiebre", "cirugía torácica"]
57
+
58
+ bs = 128 # batch size during inference
59
+ all_embs = []
60
+ for i in tqdm(np.arange(0, len(all_names), bs)):
61
+ toks = tokenizer.batch_encode_plus(all_names[i:i+bs],
62
+ padding="max_length",
63
+ max_length=25,
64
+ truncation=True,
65
+ return_tensors="pt")
66
+ toks_cuda = {}
67
+ for k,v in toks.items():
68
+ toks_cuda[k] = v.cuda()
69
+ cls_rep = model(**toks_cuda)[0][:,0,:] # use CLS representation as the embedding
70
+ all_embs.append(cls_rep.cpu().detach().numpy())
71
+
72
+ all_embs = np.concatenate(all_embs, axis=0)
73
+ ```
74
+
75
+ For more details about training and eval, see SapBERT [github repo](https://github.com/cambridgeltl/sapbert).
76
+
77
+ ## Training
78
+ The training was performed using the [original SapBERT training repository](https://github.com/cambridgeltl/sapbert). As training data, the Spanish entries in UMLS were used, as well as the commercial names of the drugs (although they are in English), transformed to lowercase. To train the model, a set of 15 pairs of synonymous terms has been generated for each UMLS concept, we have considered as synonyms the lexical entries of each concept.
79
+
80
+
81
+
82
+
83
+ ## Evaluation
84
+ To be published
85
+
86
+ ## Additional information
87
+
88
+ ### Author
89
+ NLP4BIA at the Barcelona Supercomputing Center
90
+
91
+ ### Licensing information
92
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
93
+
94
+ ### Citation information
95
+ To be published
96
+
97
+ ### Disclaimer
98
+ <details>
99
+ <summary>Click to expand</summary>
100
+
101
+ The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
102
+
103
+ When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
104
+
105
+ </details>