bert-base-sudachitra-v11
This model is a variant of SudachiTra.
The differences between the original chiTra v1.1
and bert-base-sudachitra-v11
are:
word_form_type
was changed fromnormalized_nouns
tosurface
(See GitHub - WorksApplications/SudachiTra for the latest README)
Sudachi Transformers (chiTra)
chiTra provides the pre-trained language models and a Japanese tokenizer for Transformers.
chiTra pretrained language model
We used NINJAL Web Japanese Corpus (NWJC) from National Institute for Japanese Language and Linguistics which contains around 100 million web page text.
NWJC was used after cleaning to remove unnecessary sentences.
This model trained BERT using a pre-learning script implemented by NVIDIA.
License
Copyright (c) 2022 National Institute for Japanese Language and Linguistics and Works Applications Co., Ltd. All rights reserved.
"chiTra" is distributed by National Institute for Japanese Langauge and Linguistics and Works Applications Co.,Ltd. under Apache License, Version 2.0.