raptorkwok/wordseg-tokenizer

BertTokenizer-based Tokenizer that can tokenize Chinese/Cantonese sentences into phrases

Apart from the original 51,271 tokens from the base tokenizer, 194,020 additional Chinese vocabularies are added to this tokenizer.

Usage:

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('raptorkwok/wordseg-tokenizer')

Cantonese Example 1

tokenizer.tokenize("我哋今日去睇陳奕迅演唱會")
# Output: ['我哋', '今日', '去', '睇', '陳奕迅', '演唱會']

Cantonese Example 2

tokenizer.tokenize("再嘈我打爆你個嘴！")
# Output: ['再', '嘈', '我', '打爆', '你', '個', '嘴', '！']

Chinese Example 1

tokenizer.tokenize("你很肥胖呢，要開始減肥了。")
# Output: ['你', '很', '肥胖', '呢', '，', '要', '開始', '減肥', '了', '。']

Chinese Example 2

tokenizer.tokenize("案件現由大嶼山警區重案組接手調查。")
# Output: ['案件', '現', '由', '大嶼山', '警區', '重案組', '接手', '調查', '。']

Please feel free to leave a message in the Community tab.