--- language: - zh thumbnail: https://ckip.iis.sinica.edu.tw/files/ckip_logo.png tags: - pytorch - token-classification - bert - zh license: gpl-3.0 --- # CKIP BERT Base Han Chinese WS This model provides word segmentation for the ancient Chinese language. Our training dataset covers four eras of the Chinese language. ## Homepage * [ckiplab/han-transformers](https://github.com/ckiplab/han-transformers) ## Training Datasets The copyright of the datasets belongs to the Institute of Linguistics, Academia Sinica. * [中央研究院上古漢語標記語料庫](http://lingcorpus.iis.sinica.edu.tw/cgi-bin/kiwi/akiwi/kiwi.sh) * [中央研究院中古漢語語料庫](http://lingcorpus.iis.sinica.edu.tw/cgi-bin/kiwi/dkiwi/kiwi.sh) * [中央研究院近代漢語語料庫](http://lingcorpus.iis.sinica.edu.tw/cgi-bin/kiwi/pkiwi/kiwi.sh) * [中央研究院現代漢語語料庫](http://asbc.iis.sinica.edu.tw) ## Contributors * Chin-Tung Lin at [CKIP](https://ckip.iis.sinica.edu.tw/) ## Usage * Using our model in your script ```python from transformers import ( AutoTokenizer, AutoModel, ) tokenizer = AutoTokenizer.from_pretrained("ckiplab/bert-base-han-chinese-ws") model = AutoModel.from_pretrained("ckiplab/bert-base-han-chinese-ws") ``` * Using our model for inference ```python >>> from transformers import pipeline >>> classifier = pipeline("token-classification", model="ckiplab/bert-base-han-chinese-ws") >>> classifier("帝堯曰放勳") # output [{'entity': 'B', 'score': 0.9999793, 'index': 1, 'word': '帝', 'start': 0, 'end': 1}, {'entity': 'I', 'score': 0.9915047, 'index': 2, 'word': '堯', 'start': 1, 'end': 2}, {'entity': 'B', 'score': 0.99992275, 'index': 3, 'word': '曰', 'start': 2, 'end': 3}, {'entity': 'B', 'score': 0.99905187, 'index': 4, 'word': '放', 'start': 3, 'end': 4}, {'entity': 'I', 'score': 0.96299917, 'index': 5, 'word': '勳', 'start': 4, 'end': 5}] ```