update readme

73574a5 about 3 years ago

3.74 kB

	---
	language: ja
	thumbnail: https://github.com/rinnakk/japanese-gpt2/blob/master/rinna.png
	tags:
	- ja
	- japanese
	- roberta
	- masked-lm
	- nlp
	license: mit
	datasets:
	- cc100
	- wikipedia
	widget:
	- text: "[CLS]4年に1度[MASK]は開かれる。"
	mask_token: "[MASK]"
	---

	# japanese-roberta-base

	![rinna-icon](./rinna.png)

	This repository provides a base-sized Japanese RoBERTa model. The model was trained using code from Github repository [rinnakk/japanese-pretrained-models](https://github.com/rinnakk/japanese-pretrained-models) by [rinna Co., Ltd.](https://corp.rinna.co.jp/)

	# How to load the model

	NOTE: Use `T5Tokenizer` to initiate the tokenizer.

	~~~~
	from transformers import T5Tokenizer, RobertaForMaskedLM

	tokenizer = T5Tokenizer.from_pretrained("rinna/japanese-roberta-base")
	tokenizer.do_lower_case = True # due to some bug of tokenizer config loading

	model = RobertaForMaskedLM.from_pretrained("rinna/japanese-roberta-base")
	~~~~

	# How to use the model for masked token prediction

	## Note 1: Use `[CLS]`

	To predict a masked token, be sure to add a `[CLS]` token before the sentence for the model to correctly encode it, as it is used during the model training.

	## Note 2: Use `[MASK]` after tokenization

	A) Directly typing `[MASK]` in an input string and B) replacing a token with `[MASK]` after tokenization will yield different token sequences, and thus different prediction results. It is more appropriate to use `[MASK]` after tokenization (as it is consistent with how the model was pretrained). However, the Huggingface Inference API only supports typing `[MASK]` in the input string and produces less robust predictions.

	## Example

	Here is an example by to illustrate how our model works as a masked language model. Notice the difference between running the following code example and running the Huggingface Inference API.

	~~~~
	# original text
	text = "4年に1度オリンピックは開かれる。"

	# prepend [CLS]
	text = "[CLS]" + text

	# tokenize
	tokens = tokenizer.tokenize(text)
	print(tokens) # output: ['[CLS]', '▁4', '年に', '1', '度', 'オリンピック', 'は', '開かれる', '。']

	# mask a token
	masked_idx = 6
	tokens[masked_idx] = tokenizer.mask_token
	print(tokens) # output: ['[CLS]', '▁4', '年に', '1', '度', '[MASK]', 'は', '開かれる', '。']

	# convert to ids
	token_ids = tokenizer.convert_tokens_to_ids(tokens)
	print(token_ids) # output: [4, 1602, 44, 24, 368, 6, 11, 21583, 8]

	# convert to tensor
	import torch
	token_tensor = torch.tensor([token_ids])

	# get the top 10 predictions of the masked token
	model = model.eval()
	with torch.no_grad():
	outputs = model(token_tensor)
	predictions = outputs[0][0, masked_idx].topk(10)

	for i, index_t in enumerate(predictions.indices):
	index = index_t.item()
	token = tokenizer.convert_ids_to_tokens([index])[0]
	print(i, token)

	"""
	0 ワールドカップ
	1 フェスティバル
	2 オリンピック
	3 サミット
	4 東京オリンピック
	5 総会
	6 全国大会
	7 イベント
	8 世界選手権
	9 パーティー
	"""
	~~~~

	# Model architecture
	A 12-layer, 768-hidden-size transformer-based masked language model.

	# Training
	The model was trained on [Japanese CC-100](http://data.statmt.org/cc-100/ja.txt.xz) and [Japanese Wikipedia](https://dumps.wikimedia.org/jawiki/) to optimize a masked language modelling objective on 8*V100 GPUs for around 15 days. It reaches ~3.9 perplexity on a dev set sampled from CC-100.

	# Tokenization
	The model uses a [sentencepiece](https://github.com/google/sentencepiece)-based tokenizer, the vocabulary was trained on the Japanese Wikipedia using the official sentencepiece training script.

	# Licenese
	[The MIT license](https://opensource.org/licenses/MIT)