physbert_cased / README.md
thellert's picture
Update README.md
2d8340b verified
---
library_name: transformers
tags: [physics, NLP, embedding, sentence-transformer]
---
# Model Card for PhysBERT
PhysBERT is a specialized text embedding model for physics, designed to improve information retrieval, citation classification, and clustering of physics literature. Trained on 1.2 million physics papers, it outperforms general-purpose models in physics-specific tasks.
## Model Description
PhysBERT is a BERT-based text embedding model for physics, fine-tuned using SimCSE for optimized physics-specific performance. This model enables efficient retrieval, categorization, and analysis of physics literature, achieving higher relevance and accuracy on domain-specific NLP tasks. The uncased version can be found [here](https://huggingface.co./thellert/physbert_uncased).
- **Developed by:** Thorsten Hellert, João Montenegro, Andrea Pollastro
- **Funded by:** US Department of Energy, Lawrence Berkeley National Laboratory
- **Model type:** Text embedding model (BERT-based)
- **Language(s) (NLP):** English
- **Paper:** [PhysBERT: A Text Embedding Model for Physics Scientific Literature](https://doi.org/10.1063/5.0238090)
## Training Data
Trained on a 40GB corpus from arXiv’s physics publications, consisting of 1.2 million documents, refined for scientific accuracy.
## Training Procedure
The model was pre-trained using Masked Language Modeling (MLM) and fine-tuned with SimCSE for sentence embeddings.
## Example of Usage
```python
from transformers import AutoTokenizer, AutoModel
import torch
# Load PhysBERT tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("thellert/physbert_cased")
model = AutoModel.from_pretrained("thellert/physbert_cased")
# Sample text to embed
sample_text = "Electrons exhibit both particle and wave-like behavior."
# Tokenize the input text and pass it through the model
inputs = tokenizer(sample_text, return_tensors="pt")
outputs = model(**inputs)
# Extract the token embeddings
token_embeddings = outputs.last_hidden_state
# Drop CLS and SEP tokens, then take the mean for the sentence embedding
token_embeddings = token_embeddings[:, 1:-1, :]
sentence_embedding = token_embeddings.mean(dim=1)
```
## Citation
If you find this work useful please consider citing the following paper:
```
@article{10.1063/5.0238090,
author = {Hellert, Thorsten and Montenegro, João and Pollastro, Andrea},
title = "{PhysBERT: A text embedding model for physics scientific literature}",
journal = {APL Machine Learning},
volume = {2},
number = {4},
pages = {046105},
year = {2024},
month = {10},
issn = {2770-9019},
doi = {10.1063/5.0238090},
url = {https://doi.org/10.1063/5.0238090},
eprint = {https://pubs.aip.org/aip/aml/article-pdf/doi/10.1063/5.0238090/20227307/046105\_1\_5.0238090.pdf},
}
```
## Model Card Authors
Thorsten Hellert, João Montenegro, Andrea Pollastro
## Model Card Contact
Thorsten Hellert, Lawrence Berkeley National Laboratory, [email protected]