Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks

Introduction

Herberta is a pre-trained model developed by the Angelpro Team, aimed at advancing the representation learning and modeling capabilities in Traditional Chinese Medicine (TCM). Built upon the chinese-roberta-wwm-ext-large model, Herberta leverages MLM (Masked Language Modeling) tasks to pre-train on datasets comprising 700 ancient books (538.95M) and 48 modern Chinese medicine textbooks (54M), resulting in a robust model for embedding generation and TCM-specific downstream tasks.

We named the model "Herberta" by combining "Herb" and "Roberta" to signify its purpose in herbal medicine research. Herberta is ideal for applications such as:

  • Encoder for Herbal Formulas: Generating meaningful embeddings for TCM formulations.
  • Domain-Specific Word Embedding: Serving the Chinese medicine text domain.
  • Support for TCM Downstream Tasks: Including classification, labeling, and more.

Pretraining Experiments

Dataset

Data Type Quantity Data Size
Ancient TCM Books 700 books ~538.95M
Modern TCM Textbooks 48 books ~54M
Mixed-Type Dataset Combined dataset ~637.8M

Pretrain result:

Model eval_accuracy Loss/epoch_valid Perplexity_valid
herberta_seq_512_v2 0.9841 0.04367 1.083
herberta_seq_128_v2 0.9406 0.2877 1.333
herberta_seq_512_V3 0.755 1.100 3.010

Metrics Comparison

Accuracy Loss Perplexity

Pretraining Configuration

Ancient Books

  • Pretraining Strategy: BERT-style MASK (15% tokens masked)
  • Sequence Length: 512
  • Batch Size: 32
  • Learning Rate: 1e-5 with an epoch-based decay (epoch * 0.1)
  • Tokenization: Sentence-based tokenization with padding for sequences <512 tokens.

Downstream Task: TCM Pattern Classification

Task Definition

Using 321 pattern descriptions extracted from TCM internal medicine textbooks, we evaluated the classification performance on four models:

  1. Herberta_seq_512_v2: Pretrained on 700 ancient TCM books.
  2. Herberta_seq_512_v3: Pretrained on 48 modern TCM textbooks.
  3. Herberta_seq_128_v2: Pretrained on 700 ancient TCM books (128-length sequences).
  4. Roberta: Baseline model without TCM-specific pretraining.

Training Configuration

  • Max Sequence Length: 512
  • Batch Size: 16
  • Epochs: 30

Results

Model Name Eval Accuracy Eval F1 Eval Precision Eval Recall
Herberta_seq_512_v2 0.9454 0.9293 0.9221 0.9454
Herberta_seq_512_v3 0.8989 0.8704 0.8583 0.8989
Herberta_seq_128_v2 0.8716 0.8443 0.8351 0.8716
Roberta 0.8743 0.8425 0.8311 0.8743

image/png

Summary

The Herberta_seq_512_v2 model, pretrained on 700 ancient TCM books, exhibited superior performance across all evaluation metrics. This highlights the significance of domain-specific pretraining on larger and historically richer datasets for TCM applications.


Quickstart

Use Hugging Face

from transformers import AutoTokenizer, AutoModel

model_name = "XiaoEnn/herberta"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Input text
text = "中医理论是我国传统文化的瑰宝。"

# Tokenize and prepare input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)

# Get the model's outputs
with torch.no_grad():
    outputs = model(**inputs)

# Get the embedding (sentence-level average pooling)
sentence_embedding = outputs.last_hidden_state.mean(dim=1)

print("Embedding shape:", sentence_embedding.shape)
print("Embedding vector:", sentence_embedding)

if you find our work helpful, feel free to give us a cite

@misc{herberta-embedding, title = {Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks as Text Embedding Generation}, url = {https://github.com/15392778677/herberta}, author = {Yehan Yang, Xinhan Zheng}, month = {December}, year = {2024} }

@article{herberta-technical-report, title={Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks as Text Embedding Generation}, author={Yehan Yang, Xinhan Zheng}, institution={Beijing Angelpro Technology Co., Ltd.}, year={2024}, note={Presented at the 2024 Machine Learning Applications Conference (MLAC)} }

Downloads last month
7
Safetensors
Model size
326M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for XiaoEnn/herberta_seq_512_V2

Finetuned
(9)
this model

Collection including XiaoEnn/herberta_seq_512_V2