Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks
Introduction
Herberta is a pre-trained model developed by the Angelpro Team, aimed at advancing the representation learning and modeling capabilities in Traditional Chinese Medicine (TCM). Built upon the chinese-roberta-wwm-ext-large model, Herberta leverages MLM (Masked Language Modeling) tasks to pre-train on datasets comprising 700 ancient books (538.95M) and 48 modern Chinese medicine textbooks (54M), resulting in a robust model for embedding generation and TCM-specific downstream tasks.
We named the model "Herberta" by combining "Herb" and "Roberta" to signify its purpose in herbal medicine research. Herberta is ideal for applications such as:
- Encoder for Herbal Formulas: Generating meaningful embeddings for TCM formulations.
- Domain-Specific Word Embedding: Serving the Chinese medicine text domain.
- Support for TCM Downstream Tasks: Including classification, labeling, and more.
Pretraining Experiments
Dataset
Data Type | Quantity | Data Size |
---|---|---|
Ancient TCM Books | 700 books | ~538.95M |
Modern TCM Textbooks | 48 books | ~54M |
Mixed-Type Dataset | Combined dataset | ~637.8M |
Pretrain result:
Model | eval_accuracy | Loss/epoch_valid | Perplexity_valid |
---|---|---|---|
herberta_seq_512_v2 | 0.9841 | 0.04367 | 1.083 |
herberta_seq_128_v2 | 0.9406 | 0.2877 | 1.333 |
herberta_seq_512_V3 | 0.755 | 1.100 | 3.010 |
Metrics Comparison
Pretraining Configuration
Ancient Books
- Pretraining Strategy: BERT-style MASK (15% tokens masked)
- Sequence Length: 512
- Batch Size: 32
- Learning Rate:
1e-5
with an epoch-based decay (epoch * 0.1
) - Tokenization: Sentence-based tokenization with padding for sequences <512 tokens.
Downstream Task: TCM Pattern Classification
Task Definition
Using 321 pattern descriptions extracted from TCM internal medicine textbooks, we evaluated the classification performance on four models:
- Herberta_seq_512_v2: Pretrained on 700 ancient TCM books.
- Herberta_seq_512_v3: Pretrained on 48 modern TCM textbooks.
- Herberta_seq_128_v2: Pretrained on 700 ancient TCM books (128-length sequences).
- Roberta: Baseline model without TCM-specific pretraining.
Training Configuration
- Max Sequence Length: 512
- Batch Size: 16
- Epochs: 30
Results
Model Name | Eval Accuracy | Eval F1 | Eval Precision | Eval Recall |
---|---|---|---|---|
Herberta_seq_512_v2 | 0.9454 | 0.9293 | 0.9221 | 0.9454 |
Herberta_seq_512_v3 | 0.8989 | 0.8704 | 0.8583 | 0.8989 |
Herberta_seq_128_v2 | 0.8716 | 0.8443 | 0.8351 | 0.8716 |
Roberta | 0.8743 | 0.8425 | 0.8311 | 0.8743 |
Summary
The Herberta_seq_512_v2 model, pretrained on 700 ancient TCM books, exhibited superior performance across all evaluation metrics. This highlights the significance of domain-specific pretraining on larger and historically richer datasets for TCM applications.
Quickstart
Use Hugging Face
from transformers import AutoTokenizer, AutoModel
model_name = "XiaoEnn/herberta"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Input text
text = "中医理论是我国传统文化的瑰宝。"
# Tokenize and prepare input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
# Get the model's outputs
with torch.no_grad():
outputs = model(**inputs)
# Get the embedding (sentence-level average pooling)
sentence_embedding = outputs.last_hidden_state.mean(dim=1)
print("Embedding shape:", sentence_embedding.shape)
print("Embedding vector:", sentence_embedding)
if you find our work helpful, feel free to give us a cite
@misc{herberta-embedding, title = {Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks as Text Embedding Generation}, url = {https://github.com/15392778677/herberta}, author = {Yehan Yang, Xinhan Zheng}, month = {December}, year = {2024} }
@article{herberta-technical-report, title={Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks as Text Embedding Generation}, author={Yehan Yang, Xinhan Zheng}, institution={Beijing Angelpro Technology Co., Ltd.}, year={2024}, note={Presented at the 2024 Machine Learning Applications Conference (MLAC)} }
- Downloads last month
- 7
Model tree for XiaoEnn/herberta_seq_512_V2
Base model
hfl/chinese-roberta-wwm-ext-large