thellert
/

physbert_cased

Feature Extraction

sentence-transformer

text-embeddings-inference

Inference Endpoints

Model card Files Files and versions Community

physbert_cased / README.md

thellert's picture

Update README.md

2d8340b verified 4 months ago

|

history blame contribute delete

2.99 kB

	---
	library_name: transformers
	tags: [physics, NLP, embedding, sentence-transformer]
	---

	# Model Card for PhysBERT

	PhysBERT is a specialized text embedding model for physics, designed to improve information retrieval, citation classification, and clustering of physics literature. Trained on 1.2 million physics papers, it outperforms general-purpose models in physics-specific tasks.

	## Model Description

	PhysBERT is a BERT-based text embedding model for physics, fine-tuned using SimCSE for optimized physics-specific performance. This model enables efficient retrieval, categorization, and analysis of physics literature, achieving higher relevance and accuracy on domain-specific NLP tasks. The uncased version can be found [here](https://huggingface.co./thellert/physbert_uncased).

	- Developed by: Thorsten Hellert, João Montenegro, Andrea Pollastro
	- Funded by: US Department of Energy, Lawrence Berkeley National Laboratory
	- Model type: Text embedding model (BERT-based)
	- Language(s) (NLP): English
	- Paper: [PhysBERT: A Text Embedding Model for Physics Scientific Literature](https://doi.org/10.1063/5.0238090)



	## Training Data

	Trained on a 40GB corpus from arXiv’s physics publications, consisting of 1.2 million documents, refined for scientific accuracy.

	## Training Procedure

	The model was pre-trained using Masked Language Modeling (MLM) and fine-tuned with SimCSE for sentence embeddings.

	## Example of Usage
	```python
	from transformers import AutoTokenizer, AutoModel
	import torch

	# Load PhysBERT tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained("thellert/physbert_cased")
	model = AutoModel.from_pretrained("thellert/physbert_cased")

	# Sample text to embed
	sample_text = "Electrons exhibit both particle and wave-like behavior."

	# Tokenize the input text and pass it through the model
	inputs = tokenizer(sample_text, return_tensors="pt")
	outputs = model(**inputs)

	# Extract the token embeddings
	token_embeddings = outputs.last_hidden_state
	# Drop CLS and SEP tokens, then take the mean for the sentence embedding
	token_embeddings = token_embeddings[:, 1:-1, :]
	sentence_embedding = token_embeddings.mean(dim=1)
	```

	## Citation

	If you find this work useful please consider citing the following paper:

	```
	@article{10.1063/5.0238090,
	author = {Hellert, Thorsten and Montenegro, João and Pollastro, Andrea},
	title = "{PhysBERT: A text embedding model for physics scientific literature}",
	journal = {APL Machine Learning},
	volume = {2},
	number = {4},
	pages = {046105},
	year = {2024},
	month = {10},
	issn = {2770-9019},
	doi = {10.1063/5.0238090},
	url = {https://doi.org/10.1063/5.0238090},
	eprint = {https://pubs.aip.org/aip/aml/article-pdf/doi/10.1063/5.0238090/20227307/046105\_1\_5.0238090.pdf},
	}
	```

	## Model Card Authors

	Thorsten Hellert, João Montenegro, Andrea Pollastro

	## Model Card Contact

	Thorsten Hellert, Lawrence Berkeley National Laboratory, [email protected]