--- language: - en tags: - Machine Learning - Research Papers - Scientific Language Model - Entity license: apache-2.0 --- ## MLEntityRoBERTa ## How to use: ``` from transformers import AutoTokenizer, AutoModel tok = AutoTokenizer.from_pretrained('shrutisingh/MLEntityRoBERTa') model = AutoModel.from_pretrained('shrutisingh/MLEntityRoBERTa') ``` ## Pretraining Details: This is a variant of the [MLRoBERTa model](https://huggingface.co./shrutisingh/MLRoBERTa/blob/main/README.md) which is trained on a masked dataset. The dataset of MLRoBERTa is modified to replace specific scientific entities in a paper with generic labels. The idea is to make the model focus more on the syntax and semantics of the text without getting confused by specific entity names. Scientific entities which belong to any one of the classes: TDMM (task, dataset, method, metric) are masked with these specific labels. The entity set is manually cleaned and mapped to appropriate labels. Eg: The authors present results on MNIST. -> The authors present results on dataset. ## Citation: ``` @inproceedings{singh2021compare, title={COMPARE: a taxonomy and dataset of comparison discussions in peer reviews}, author={Singh, Shruti and Singh, Mayank and Goyal, Pawan}, booktitle={2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL)}, pages={238--241}, year={2021}, organization={IEEE} } ```