---
library_name: transformers
tags: [physics, NLP, embedding, sentence-transformer]
---

# Model Card for PhysBERT

PhysBERT is a specialized text embedding model for physics, designed to improve information retrieval, citation classification, and clustering of physics literature. Trained on 1.2 million physics papers, it outperforms general-purpose models in physics-specific tasks.

## Model Description

PhysBERT is a BERT-based text embedding model for physics, fine-tuned using SimCSE for optimized physics-specific performance. This model enables efficient retrieval, categorization, and analysis of physics literature, achieving higher relevance and accuracy on domain-specific NLP tasks. The uncased version can be found [here](https://huggingface.co./thellert/physbert_uncased).

- **Developed by:** Thorsten Hellert, João Montenegro, Andrea Pollastro
- **Funded by:** US Department of Energy, Lawrence Berkeley National Laboratory
- **Model type:** Text embedding model (BERT-based)
- **Language(s) (NLP):** English
- **Paper:** [PhysBERT: A Text Embedding Model for Physics Scientific Literature](https://doi.org/10.1063/5.0238090)


## Training Data

Trained on a 40GB corpus from arXiv’s physics publications, consisting of 1.2 million documents, refined for scientific accuracy.

## Training Procedure

The model was pre-trained using Masked Language Modeling (MLM) and fine-tuned with SimCSE for sentence embeddings.

## Example of Usage
```python
from transformers import AutoTokenizer, AutoModel
import torch

# Load PhysBERT tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("thellert/physbert_cased")
model = AutoModel.from_pretrained("thellert/physbert_cased")

# Sample text to embed
sample_text = "Electrons exhibit both particle and wave-like behavior."

# Tokenize the input text and pass it through the model
inputs = tokenizer(sample_text, return_tensors="pt")
outputs = model(**inputs)

# Extract the token embeddings
token_embeddings = outputs.last_hidden_state
# Drop CLS and SEP tokens, then take the mean for the sentence embedding
token_embeddings = token_embeddings[:, 1:-1, :]
sentence_embedding = token_embeddings.mean(dim=1)
```

## Citation

If you find this work useful please consider citing the following paper:

```
@article{10.1063/5.0238090,
    author = {Hellert, Thorsten and Montenegro, João and Pollastro, Andrea},
    title = "{PhysBERT: A text embedding model for physics scientific literature}",
    journal = {APL Machine Learning},
    volume = {2},
    number = {4},
    pages = {046105},
    year = {2024},
    month = {10},
    issn = {2770-9019},
    doi = {10.1063/5.0238090},
    url = {https://doi.org/10.1063/5.0238090},
    eprint = {https://pubs.aip.org/aip/aml/article-pdf/doi/10.1063/5.0238090/20227307/046105\_1\_5.0238090.pdf},
}
```

## Model Card Authors

Thorsten Hellert, João Montenegro, Andrea Pollastro

## Model Card Contact

Thorsten Hellert, Lawrence Berkeley National Laboratory, thellert@lbl.gov