BertTokenizer-based Tokenizer that can tokenize Chinese/Cantonese sentences into phrases

Apart from the original 51,271 tokens from the base tokenizer, 194,020 additional Chinese vocabularies are added to this tokenizer.

Usage:

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('raptorkwok/wordseg-tokenizer')

Examples:

Cantonese Example 1

tokenizer.tokenize("我哋今日去睇陳奕迅演唱會")
# Output: ['我哋', '今日', '去', '睇', '陳奕迅', '演唱會']

Cantonese Example 2

tokenizer.tokenize("再嘈我打爆你個嘴!")
# Output: ['再', '嘈', '我', '打爆', '你', '個', '嘴', '!']

Chinese Example 1

tokenizer.tokenize("你很肥胖呢,要開始減肥了。")
# Output: ['你', '很', '肥胖', '呢', ',', '要', '開始', '減肥', '了', '。']

Chinese Example 2

tokenizer.tokenize("案件現由大嶼山警區重案組接手調查。")
# Output: ['案件', '現', '由', '大嶼山', '警區', '重案組', '接手', '調查', '。']

Questions?

Please feel free to leave a message in the Community tab.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for raptorkwok/wordseg-tokenizer

Finetuned
(8)
this model