metadata

language:
  - zh
thumbnail: https://ckip.iis.sinica.edu.tw/files/ckip_logo.png
tags:
  - pytorch
  - token-classification
  - bert
  - zh
license: gpl-3.0

CKIP Oldhan BERT Base Chinese WS

This model provides word segmentation for the oldhan Chinese language. Our training dataset covers four eras of the Chinese language.

Homepage

ckiplab/han-transformers

Training Datasets

The copyright of the datasets belongs to the Institute of Linguistics, Academia Sinica.

中央研究院上古漢語標記語料庫
中央研究院中古漢語語料庫
中央研究院近代漢語語料庫
中央研究院現代漢語語料庫

Contributors

Chin-Tung Lin at CKIP

Usage

Using our model in your script

from transformers import (
  AutoTokenizer,
  AutoModel,
)

tokenizer = AutoTokenizer.from_pretrained("ckiplab/oldhan-bert-base-chinese-ws")
model = AutoModel.from_pretrained("ckiplab/oldhan-bert-base-chinese-ws")

Using our model for inference

>>> from transformers import pipeline
>>> classifier = pipeline("token-classification", model="ckiplab/oldhan-bert-base-chinese-ws")
>>> classifier("帝堯曰放勳")