roberta-long-japanese (jumanpp + sentencepiece, mC4 Japanese)
This is the longer input version of RoBERTa Japanese model pretrained on approximately 200M Japanese sentences.
max_position_embeddings
has been increased to 1282
, allowing it to handle much longer inputs than the basic RoBERTa
model.
The tokenization model and logic is completely same as nlp-waseda/roberta-base-japanese.
The input text should be pretokenized by Juman++ v2.0.0-rc3 and then the SentencePiece tokenization will be applied for the whitespace-separated token sequences.
See tokenizer_config.json
for details.
How to use
Please install Juman++ v2.0.0-rc3
and SentencePiece
in advance.
- https://github.com/ku-nlp/jumanpp#building-from-a-package
- https://github.com/google/sentencepiece#python-module
You can load the model and the tokenizer via AutoModel and AutoTokenizer, respectively.
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("megagonlabs/roberta-long-japanese")
tokenizer = AutoTokenizer.from_pretrained("megagonlabs/roberta-long-japanese")
model(**tokenizer("まさに オール マイ ティー な 商品 だ 。", return_tensors="pt")).last_hidden_state
tensor([[[ 0.1549, -0.7576, 0.1098, ..., 0.7124, 0.8062, -0.9880],
[-0.6586, -0.6138, -0.5253, ..., 0.8853, 0.4822, -0.6463],
[-0.4502, -1.4675, -0.4095, ..., 0.9053, -0.2017, -0.7756],
...,
[ 0.3505, -1.8235, -0.6019, ..., -0.0906, -0.5479, -0.6899],
[ 1.0524, -0.8609, -0.6029, ..., 0.1022, -0.6802, 0.0982],
[ 0.6519, -0.2042, -0.6205, ..., -0.0738, -0.0302, -0.1955]]],
grad_fn=<NativeLayerNormBackward0>)
Model architecture
The model architecture is almost the same as nlp-waseda/roberta-base-japanese except max_position_embeddings
has been increased to 1282
; 12 layers, 768 dimensions of hidden states, and 12 attention heads.
Training data and libraries
This model is trained on the Japanese texts extracted from the mC4 Common Crawl's multilingual web crawl corpus. We used the Sudachi to split texts into sentences, and also applied a simple rule-based filter to remove nonlinguistic segments of mC4 multilingual corpus. The extracted texts contains over 600M sentences in total, and we used approximately 200M sentences for pretraining.
We used huggingface/transformers RoBERTa implementation for pretraining. The time required for the pretrainig was about 700 hours using GCP A100 8gpu instance with enabling Automatic Mixed Precision.
Licenses
The pretrained models are distributed under the terms of the MIT License.
Citations
- mC4
Contains information from mC4
which is made available under the ODC Attribution License.
@article{2019t5,
author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
journal = {arXiv e-prints},
year = {2019},
archivePrefix = {arXiv},
eprint = {1910.10683},
}
- Downloads last month
- 21