--- id: mirrorbert_MedRoBERTa.nl_meantoken name: mirrorbert_MedRoBERTa.nl_meantoken description: MedRoBERTa.nl continued pre-training on hard medical terms pairs from the UMLS/SNOMED ontology, using the infoNCE loss function, as implemented in MirrorBERT license: gpl-3.0 language: nl tags: - biology - embedding - entity linking - biomedical - science - bionlp - lexical semantic pipeline_tag: feature-extraction --- # Model Card for Mirrorbert Medroberta.Nl Meantoken The model was trained on medical entity triplets (anchor, term, synonym) ### Expected input and output The input should be a string of biomedical entity names, e.g., "covid infection" or "Hydroxychloroquine". The [CLS] embedding of the last layer is regarded as the output. #### Extracting embeddings from mirrorbert_MedRoBERTa.nl_meantoken The following script converts a list of strings (entity names) into embeddings. ```python import numpy as np import torch from tqdm.auto import tqdm from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("UMCU/mirrorbert_MedRoBERTa.nl_meantoken") model = AutoModel.from_pretrained("UMCU/mirrorbert_MedRoBERTa.nl_meantoken").cuda() # replace with your own list of entity names all_names = ["covid-19", "Coronavirus infection", "high fever", "Tumor of posterior wall of oropharynx"] bs = 128 # batch size during inference all_embs = [] for i in tqdm(np.arange(0, len(all_names), bs)): toks = tokenizer.batch_encode_plus(all_names[i:i+bs], padding="max_length", max_length=25, truncation=True, return_tensors="pt") toks_cuda = {} for k,v in toks.items(): toks_cuda[k] = v.cuda() cls_rep = model(**toks_cuda)[0].mean(1) all_embs.append(cls_rep.cpu().detach().numpy()) all_embs = np.concatenate(all_embs, axis=0) ``` # Data description Hard Dutch UMLS/SNOMED synonym pairs (terms referring to the same CUI/SCUI),and including English medication names # Acknowledgement This is part of the [DT4H project](https://www.datatools4heart.eu/). # Doi and reference For more details about training and eval, see MirrorBERT [github repo](https://github.com/cambridgeltl/mirror-bert). ### Citation ```bibtex @inproceedings{liu-etal-2021-fast, title = "Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders", author = "Liu, Fangyu and Vuli{'c}, Ivan and Korhonen, Anna and Collier, Nigel", booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2021", address = "Online and Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.emnlp-main.109", pages = "1442--1459", } ``` For more details about training/eval and other scripts, see CardioNER [github repo](https://github.com/DataTools4Heart/CardioNER). and for more information on the background, see Datatools4Heart [Huggingface](https://huggingface.co./DT4H)/[Website](https://www.datatools4heart.eu/)