|
--- |
|
language: ja |
|
thumbnail: https://github.com/rinnakk/japanese-gpt2/blob/master/rinna.png |
|
tags: |
|
- ja |
|
- japanese |
|
- roberta |
|
- masked-lm |
|
- nlp |
|
license: mit |
|
datasets: |
|
- cc100 |
|
- wikipedia |
|
widget: |
|
- text: "[CLS]4年に1度[MASK]は開かれる。" |
|
mask_token: "[MASK]" |
|
--- |
|
|
|
# japanese-roberta-base |
|
|
|
![rinna-icon](./rinna.png) |
|
|
|
This repository provides a base-sized Japanese RoBERTa model. The model was trained using code from Github repository [rinnakk/japanese-pretrained-models](https://github.com/rinnakk/japanese-pretrained-models) by [rinna Co., Ltd.](https://corp.rinna.co.jp/) |
|
|
|
# How to load the model |
|
|
|
*NOTE:* Use `T5Tokenizer` to initiate the tokenizer. |
|
|
|
~~~~ |
|
from transformers import T5Tokenizer, RobertaForMaskedLM |
|
|
|
tokenizer = T5Tokenizer.from_pretrained("rinna/japanese-roberta-base") |
|
tokenizer.do_lower_case = True # due to some bug of tokenizer config loading |
|
|
|
model = RobertaForMaskedLM.from_pretrained("rinna/japanese-roberta-base") |
|
~~~~ |
|
|
|
# How to use the model for masked token prediction |
|
|
|
## Note 1: Use `[CLS]` |
|
|
|
To predict a masked token, be sure to add a `[CLS]` token before the sentence for the model to correctly encode it, as it is used during the model training. |
|
|
|
## Note 2: Use `[MASK]` after tokenization |
|
|
|
A) Directly typing `[MASK]` in an input string and B) replacing a token with `[MASK]` after tokenization will yield different token sequences, and thus different prediction results. It is more appropriate to use `[MASK]` after tokenization (as it is consistent with how the model was pretrained). However, the Huggingface Inference API only supports typing `[MASK]` in the input string and produces less robust predictions. |
|
|
|
## Example |
|
|
|
Here is an example by to illustrate how our model works as a masked language model. Notice the difference between running the following code example and running the Huggingface Inference API. |
|
|
|
~~~~ |
|
# original text |
|
text = "4年に1度オリンピックは開かれる。" |
|
|
|
# prepend [CLS] |
|
text = "[CLS]" + text |
|
|
|
# tokenize |
|
tokens = tokenizer.tokenize(text) |
|
print(tokens) # output: ['[CLS]', '▁4', '年に', '1', '度', 'オリンピック', 'は', '開かれる', '。'] |
|
|
|
# mask a token |
|
masked_idx = 6 |
|
tokens[masked_idx] = tokenizer.mask_token |
|
print(tokens) # output: ['[CLS]', '▁4', '年に', '1', '度', '[MASK]', 'は', '開かれる', '。'] |
|
|
|
# convert to ids |
|
token_ids = tokenizer.convert_tokens_to_ids(tokens) |
|
print(token_ids) # output: [4, 1602, 44, 24, 368, 6, 11, 21583, 8] |
|
|
|
# convert to tensor |
|
import torch |
|
token_tensor = torch.tensor([token_ids]) |
|
|
|
# get the top 10 predictions of the masked token |
|
model = model.eval() |
|
with torch.no_grad(): |
|
outputs = model(token_tensor) |
|
predictions = outputs[0][0, masked_idx].topk(10) |
|
|
|
for i, index_t in enumerate(predictions.indices): |
|
index = index_t.item() |
|
token = tokenizer.convert_ids_to_tokens([index])[0] |
|
print(i, token) |
|
|
|
""" |
|
0 ワールドカップ |
|
1 フェスティバル |
|
2 オリンピック |
|
3 サミット |
|
4 東京オリンピック |
|
5 総会 |
|
6 全国大会 |
|
7 イベント |
|
8 世界選手権 |
|
9 パーティー |
|
""" |
|
~~~~ |
|
|
|
# Model architecture |
|
A 12-layer, 768-hidden-size transformer-based masked language model. |
|
|
|
# Training |
|
The model was trained on [Japanese CC-100](http://data.statmt.org/cc-100/ja.txt.xz) and [Japanese Wikipedia](https://dumps.wikimedia.org/jawiki/) to optimize a masked language modelling objective on 8*V100 GPUs for around 15 days. It reaches ~3.9 perplexity on a dev set sampled from CC-100. |
|
|
|
# Tokenization |
|
The model uses a [sentencepiece](https://github.com/google/sentencepiece)-based tokenizer, the vocabulary was trained on the Japanese Wikipedia using the official sentencepiece training script. |
|
|
|
# Licenese |
|
[The MIT license](https://opensource.org/licenses/MIT) |
|
|