hiroshi-matsuda-rit's picture
initial commit
4bd6d24
|
raw
history blame
1.39 kB

bert-base-sudachitra-v11

This model is a variant of SudachiTra. The differences between the original chiTra v1.1 and bert-base-sudachitra-v11 are:

  • word_form_type was changed from normalized_nouns to surface

(See GitHub - WorksApplications/SudachiTra for the latest README)

Sudachi Transformers (chiTra)

chiTra provides the pre-trained language models and a Japanese tokenizer for Transformers.

chiTra pretrained language model

We used NINJAL Web Japanese Corpus (NWJC) from National Institute for Japanese Language and Linguistics which contains around 100 million web page text.

NWJC was used after cleaning to remove unnecessary sentences.

This model trained BERT using a pre-learning script implemented by NVIDIA.

License

Copyright (c) 2022 National Institute for Japanese Language and Linguistics and Works Applications Co., Ltd. All rights reserved.

"chiTra" is distributed by National Institute for Japanese Langauge and Linguistics and Works Applications Co.,Ltd. under Apache License, Version 2.0.