language: ja
license: apache-2.0
tags:
- SudachiTra
- Sudachi
- SudachiPy
- bert
- Japanese
- NWJC
datasets:
- NWJC
bert-base-sudachitra-v11
This model is a variant of SudachiTra.
The differences between the original chiTra v1.1
and bert-base-sudachitra-v11
are:
word_form_type
was changed fromnormalized_nouns
tosurface
- Replacing continuous two empty lines with a dummy entry and an empty line in
vocab.txt
Also read the original README.md
descriptions below.
(See GitHub - WorksApplications/SudachiTra for the latest README)
Sudachi Transformers (chiTra)
chiTra provides the pre-trained language models and a Japanese tokenizer for Transformers.
chiTra pretrained language model
We used NINJAL Web Japanese Corpus (NWJC) from National Institute for Japanese Language and Linguistics which contains around 100 million web page text.
NWJC was used after cleaning to remove unnecessary sentences.
This model trained BERT using a pre-learning script implemented by NVIDIA.
License
Copyright (c) 2022 National Institute for Japanese Language and Linguistics and Works Applications Co., Ltd. All rights reserved.
"chiTra" is distributed by National Institute for Japanese Langauge and Linguistics and Works Applications Co.,Ltd. under Apache License, Version 2.0.
Citation
@INPROCEEDINGS{katsuta2022chitra,
author = {勝田哲弘, 林政義, 山村崇, Tolmachev Arseny, 高岡一馬, 内田佳孝, 浅原正幸},
title = {単語正規化による表記ゆれに頑健な BERT モデルの構築},
booktitle = "言語処理学会第28回年次大会(NLP2022)",
year = "2022",
pages = "",
publisher = "言語処理学会",
}