bert-base-sudachitra-v11

This model is a variant of SudachiTra. The differences between the original chiTra v1.1 and bert-base-sudachitra-v11 are:

word_form_type was changed from normalized_nouns to surface

(See GitHub - WorksApplications/SudachiTra for the latest README)

Sudachi Transformers (chiTra)

chiTra provides the pre-trained language models and a Japanese tokenizer for Transformers.

chiTra pretrained language model

We used NINJAL Web Japanese Corpus (NWJC) from National Institute for Japanese Language and Linguistics which contains around 100 million web page text.

NWJC was used after cleaning to remove unnecessary sentences.

This model trained BERT using a pre-learning script implemented by NVIDIA.

License

"chiTra" is distributed by National Institute for Japanese Langauge and Linguistics and Works Applications Co.,Ltd. under Apache License, Version 2.0.