conan1024hao
/

cjkbert-small

Inference Endpoints

Model card Files Files and versions Community

Edit model card

Model description

This model was trained on ZH, JA, KO's Wikipedia (5 epochs).

How to use

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("conan1024hao/cjkbert-small")
model = AutoModelForMaskedLM.from_pretrained("conan1024hao/cjkbert-small")

Before you fine-tune downstream tasks, you don't need any text segmentation.
(Though you may obtain better results if you applied morphological analysis to the data before fine-tuning)

Morphological analysis tools

ZH: For Chinese, we use LTP.
JA: For Japanese, we use Juman++.
KO: For Korean, we use KoNLPy(Kkma class).

Tokenization

We use character-based tokenization with whole-word-masking strategy.

Model size

vocab_size: 15015
num_hidden_layers: 4
hidden_size: 512
num_attention_heads: 8
param_num: 25M

Downloads last month: 7

Inference Examples

This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train conan1024hao/cjkbert-small

Space using conan1024hao/cjkbert-small 1