File size: 2,094 Bytes
3999629
664fc70
 
 
 
 
 
 
 
3999629
 
664fc70
7b436d9
664fc70
023658a
664fc70
 
 
 
 
 
023658a
 
 
 
664fc70
 
 
 
 
 
 
 
 
 
 
 
 
7b436d9
 
664fc70
 
 
 
 
7b436d9
664fc70
 
7b5f4fe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
664fc70
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
language:
  - zh
thumbnail: https://ckip.iis.sinica.edu.tw/files/ckip_logo.png
tags:
  - pytorch
  - token-classification
  - bert
  - zh
license: gpl-3.0
---

# CKIP BERT Base Han Chinese WS

This model provides word segmentation for the ancient Chinese language. Our training dataset covers four eras of the Chinese language.

## Homepage
* [ckiplab/han-transformers](https://github.com/ckiplab/han-transformers)

## Training Datasets
The copyright of the datasets belongs to the Institute of Linguistics, Academia Sinica.
* [中央研究院上古漢語標記語料庫](http://lingcorpus.iis.sinica.edu.tw/cgi-bin/kiwi/akiwi/kiwi.sh)
* [中央研究院中古漢語語料庫](http://lingcorpus.iis.sinica.edu.tw/cgi-bin/kiwi/dkiwi/kiwi.sh)
* [中央研究院近代漢語語料庫](http://lingcorpus.iis.sinica.edu.tw/cgi-bin/kiwi/pkiwi/kiwi.sh)
* [中央研究院現代漢語語料庫](http://asbc.iis.sinica.edu.tw)

## Contributors
* Chin-Tung Lin at [CKIP](https://ckip.iis.sinica.edu.tw/)

## Usage

* Using our model in your script
    ```python
    from transformers import (
      AutoTokenizer,
      AutoModel,
    )

    tokenizer = AutoTokenizer.from_pretrained("ckiplab/bert-base-han-chinese-ws")
    model = AutoModel.from_pretrained("ckiplab/bert-base-han-chinese-ws")
    ```

* Using our model for inference
    ```python
    >>> from transformers import pipeline
    >>> classifier = pipeline("token-classification", model="ckiplab/bert-base-han-chinese-ws")
    >>> classifier("帝堯曰放勳")

    # output
    [{'entity': 'B',
    'score': 0.9999793,
    'index': 1,
    'word': '帝',
    'start': 0,
    'end': 1},
    {'entity': 'I',
    'score': 0.9915047,
    'index': 2,
    'word': '堯',
    'start': 1,
    'end': 2},
    {'entity': 'B',
    'score': 0.99992275,
    'index': 3,
    'word': '曰',
    'start': 2,
    'end': 3},
    {'entity': 'B',
    'score': 0.99905187,
    'index': 4,
    'word': '放',
    'start': 3,
    'end': 4},
    {'entity': 'I',
    'score': 0.96299917,
    'index': 5,
    'word': '勳',
    'start': 4,
    'end': 5}]
    ```