|
--- |
|
library_name: transformers |
|
tags: [physics, NLP, embedding, sentence-transformer] |
|
--- |
|
|
|
# Model Card for PhysBERT |
|
|
|
PhysBERT is a specialized text embedding model for physics, designed to improve information retrieval, citation classification, and clustering of physics literature. Trained on 1.2 million physics papers, it outperforms general-purpose models in physics-specific tasks. |
|
|
|
## Model Description |
|
|
|
PhysBERT is a BERT-based text embedding model for physics, fine-tuned using SimCSE for optimized physics-specific performance. This model enables efficient retrieval, categorization, and analysis of physics literature, achieving higher relevance and accuracy on domain-specific NLP tasks. The uncased version can be found [here](https://huggingface.co./thellert/physbert_uncased). |
|
|
|
- **Developed by:** Thorsten Hellert, João Montenegro, Andrea Pollastro |
|
- **Funded by:** US Department of Energy, Lawrence Berkeley National Laboratory |
|
- **Model type:** Text embedding model (BERT-based) |
|
- **Language(s) (NLP):** English |
|
- **Paper:** [PhysBERT: A Text Embedding Model for Physics Scientific Literature](https://doi.org/10.1063/5.0238090) |
|
|
|
|
|
|
|
## Training Data |
|
|
|
Trained on a 40GB corpus from arXiv’s physics publications, consisting of 1.2 million documents, refined for scientific accuracy. |
|
|
|
## Training Procedure |
|
|
|
The model was pre-trained using Masked Language Modeling (MLM) and fine-tuned with SimCSE for sentence embeddings. |
|
|
|
## Example of Usage |
|
```python |
|
from transformers import AutoTokenizer, AutoModel |
|
import torch |
|
|
|
# Load PhysBERT tokenizer and model |
|
tokenizer = AutoTokenizer.from_pretrained("thellert/physbert_cased") |
|
model = AutoModel.from_pretrained("thellert/physbert_cased") |
|
|
|
# Sample text to embed |
|
sample_text = "Electrons exhibit both particle and wave-like behavior." |
|
|
|
# Tokenize the input text and pass it through the model |
|
inputs = tokenizer(sample_text, return_tensors="pt") |
|
outputs = model(**inputs) |
|
|
|
# Extract the token embeddings |
|
token_embeddings = outputs.last_hidden_state |
|
# Drop CLS and SEP tokens, then take the mean for the sentence embedding |
|
token_embeddings = token_embeddings[:, 1:-1, :] |
|
sentence_embedding = token_embeddings.mean(dim=1) |
|
``` |
|
|
|
## Citation |
|
|
|
If you find this work useful please consider citing the following paper: |
|
|
|
``` |
|
@article{10.1063/5.0238090, |
|
author = {Hellert, Thorsten and Montenegro, João and Pollastro, Andrea}, |
|
title = "{PhysBERT: A text embedding model for physics scientific literature}", |
|
journal = {APL Machine Learning}, |
|
volume = {2}, |
|
number = {4}, |
|
pages = {046105}, |
|
year = {2024}, |
|
month = {10}, |
|
issn = {2770-9019}, |
|
doi = {10.1063/5.0238090}, |
|
url = {https://doi.org/10.1063/5.0238090}, |
|
eprint = {https://pubs.aip.org/aip/aml/article-pdf/doi/10.1063/5.0238090/20227307/046105\_1\_5.0238090.pdf}, |
|
} |
|
``` |
|
|
|
## Model Card Authors |
|
|
|
Thorsten Hellert, João Montenegro, Andrea Pollastro |
|
|
|
## Model Card Contact |
|
|
|
Thorsten Hellert, Lawrence Berkeley National Laboratory, [email protected] |