--- library_name: transformers tags: [physics, NLP, embedding, sentence-transformer] --- # Model Card for PhysBERT PhysBERT is a specialized text embedding model for physics, designed to improve information retrieval, citation classification, and clustering of physics literature. Trained on 1.2 million physics papers, it outperforms general-purpose models in physics-specific tasks. ## Model Description PhysBERT is a BERT-based text embedding model for physics, fine-tuned using SimCSE for optimized physics-specific performance. This model enables efficient retrieval, categorization, and analysis of physics literature, achieving higher relevance and accuracy on domain-specific NLP tasks. The uncased version can be found [here](https://huggingface.co./thellert/physbert_uncased). - **Developed by:** Thorsten Hellert, João Montenegro, Andrea Pollastro - **Funded by:** US Department of Energy, Lawrence Berkeley National Laboratory - **Model type:** Text embedding model (BERT-based) - **Language(s) (NLP):** English - **Paper:** [PhysBERT: A Text Embedding Model for Physics Scientific Literature](https://doi.org/10.1063/5.0238090) ## Training Data Trained on a 40GB corpus from arXiv’s physics publications, consisting of 1.2 million documents, refined for scientific accuracy. ## Training Procedure The model was pre-trained using Masked Language Modeling (MLM) and fine-tuned with SimCSE for sentence embeddings. ## Example of Usage ```python from transformers import AutoTokenizer, AutoModel import torch # Load PhysBERT tokenizer and model tokenizer = AutoTokenizer.from_pretrained("thellert/physbert_cased") model = AutoModel.from_pretrained("thellert/physbert_cased") # Sample text to embed sample_text = "Electrons exhibit both particle and wave-like behavior." # Tokenize the input text and pass it through the model inputs = tokenizer(sample_text, return_tensors="pt") outputs = model(**inputs) # Extract the token embeddings token_embeddings = outputs.last_hidden_state # Drop CLS and SEP tokens, then take the mean for the sentence embedding token_embeddings = token_embeddings[:, 1:-1, :] sentence_embedding = token_embeddings.mean(dim=1) ``` ## Citation If you find this work useful please consider citing the following paper: ``` @article{10.1063/5.0238090, author = {Hellert, Thorsten and Montenegro, João and Pollastro, Andrea}, title = "{PhysBERT: A text embedding model for physics scientific literature}", journal = {APL Machine Learning}, volume = {2}, number = {4}, pages = {046105}, year = {2024}, month = {10}, issn = {2770-9019}, doi = {10.1063/5.0238090}, url = {https://doi.org/10.1063/5.0238090}, eprint = {https://pubs.aip.org/aip/aml/article-pdf/doi/10.1063/5.0238090/20227307/046105\_1\_5.0238090.pdf}, } ``` ## Model Card Authors Thorsten Hellert, João Montenegro, Andrea Pollastro ## Model Card Contact Thorsten Hellert, Lawrence Berkeley National Laboratory, thellert@lbl.gov