--- tags: - PretrainModel - TCM - transformer - herberta - text-embedding license: apache-2.0 language: - zh - en metrics: - accuracy base_model: - hfl/chinese-roberta-wwm-ext-large new_version: XiaoEnn/herberta_seq_512_V2 inference: true library_name: transformers --- # Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks ## Introduction Herberta is a pre-trained model developed by the Angelpro Team, aimed at advancing the representation learning and modeling capabilities in Traditional Chinese Medicine (TCM). Built upon the **chinese-roberta-wwm-ext-large** model, Herberta leverages MLM (Masked Language Modeling) tasks to pre-train on datasets comprising **700 ancient books (538.95M)** and **48 modern Chinese medicine textbooks (54M)**, resulting in a robust model for embedding generation and TCM-specific downstream tasks. We named the model "Herberta" by combining "Herb" and "Roberta" to signify its purpose in herbal medicine research. Herberta is ideal for applications such as: - **Encoder for Herbal Formulas**: Generating meaningful embeddings for TCM formulations. - **Domain-Specific Word Embedding**: Serving the Chinese medicine text domain. - **Support for TCM Downstream Tasks**: Including classification, labeling, and more. --- ## Pretraining Experiments ### Dataset | Data Type | Quantity | Data Size | |------------------------|-------------|------------------| | **Ancient TCM Books** | 700 books | ~538.95M | | **Modern TCM Textbooks** | 48 books | ~54M | | **Mixed-Type Dataset** | Combined dataset | ~637.8M | ### Pretrain result: | Model | eval_accuracy | Loss/epoch_valid | Perplexity_valid | |-----------------------|---------------|------------------|------------------| | **herberta_seq_512_v2** | 0.9841 | 0.04367 | 1.083 | | **herberta_seq_128_v2** | 0.9406 | 0.2877 | 1.333 | | **herberta_seq_512_V3** | 0.755 |1.100 | 3.010 | #### Metrics Comparison ![Accuracy](https://cdn-uploads.huggingface.co/production/uploads/6564baaa393bae9c194fc32e/RDgI-0Ro2kMiwV853Wkgx.png) ![Loss](https://cdn-uploads.huggingface.co/production/uploads/6564baaa393bae9c194fc32e/BJ7enbRg13IYAZuxwraPP.png) ![Perplexity](https://cdn-uploads.huggingface.co/production/uploads/6564baaa393bae9c194fc32e/lOohRMIctPJZKM5yEEcQ2.png) ### Pretraining Configuration #### Ancient Books - Pretraining Strategy: BERT-style MASK (15% tokens masked) - Sequence Length: 512 - Batch Size: 32 - Learning Rate: `1e-5` with an epoch-based decay (`epoch * 0.1`) - Tokenization: Sentence-based tokenization with padding for sequences <512 tokens. --- ## Downstream Task: TCM Pattern Classification ### Task Definition Using **321 pattern descriptions** extracted from TCM internal medicine textbooks, we evaluated the classification performance on four models: 1. **Herberta_seq_512_v2**: Pretrained on 700 ancient TCM books. 2. **Herberta_seq_512_v3**: Pretrained on 48 modern TCM textbooks. 3. **Herberta_seq_128_v2**: Pretrained on 700 ancient TCM books (128-length sequences). 4. **Roberta**: Baseline model without TCM-specific pretraining. ### Training Configuration - Max Sequence Length: 512 - Batch Size: 16 - Epochs: 30 ### Results | Model Name | Eval Accuracy | Eval F1 | Eval Precision | Eval Recall | |--------------------------|---------------|-----------|----------------|-------------| | **Herberta_seq_512_v2** | **0.9454** | **0.9293** | **0.9221** | **0.9454** | | **Herberta_seq_512_v3** | 0.8989 | 0.8704 | 0.8583 | 0.8989 | | **Herberta_seq_128_v2** | 0.8716 | 0.8443 | 0.8351 | 0.8716 | | **Roberta** | 0.8743 | 0.8425 | 0.8311 | 0.8743 | ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6564baaa393bae9c194fc32e/1yG96YdzXuxQlTfjOmXqg.png) #### Summary The **Herberta_seq_512_v2** model, pretrained on 700 ancient TCM books, exhibited superior performance across all evaluation metrics. This highlights the significance of domain-specific pretraining on larger and historically richer datasets for TCM applications. --- ## Quickstart ### Use Hugging Face ```python from transformers import AutoTokenizer, AutoModel model_name = "XiaoEnn/herberta" # Load tokenizer and model tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name) # Input text text = "中医理论是我国传统文化的瑰宝。" # Tokenize and prepare input inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128) # Get the model's outputs with torch.no_grad(): outputs = model(**inputs) # Get the embedding (sentence-level average pooling) sentence_embedding = outputs.last_hidden_state.mean(dim=1) print("Embedding shape:", sentence_embedding.shape) print("Embedding vector:", sentence_embedding) ``` if you find our work helpful, feel free to give us a cite @misc{herberta-embedding, title = {Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks as Text Embedding Generation}, url = {https://github.com/15392778677/herberta}, author = {Yehan Yang, Xinhan Zheng}, month = {December}, year = {2024} } @article{herberta-technical-report, title={Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks as Text Embedding Generation}, author={Yehan Yang, Xinhan Zheng}, institution={Beijing Angelpro Technology Co., Ltd.}, year={2024}, note={Presented at the 2024 Machine Learning Applications Conference (MLAC)} }