File size: 5,993 Bytes
27de098
1859ee2
27de098
1859ee2
 
27de098
1859ee2
 
 
 
 
 
b514ced
 
1859ee2
 
 
b514ced
1859ee2
 
 
 
 
3b4a75e
1859ee2
 
 
 
 
 
 
3b4a75e
 
1859ee2
3b4a75e
 
 
 
 
 
 
1859ee2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3b4a75e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1859ee2
 
 
 
 
 
 
 
2187b7a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3b4a75e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
language: ja
license: mit
datasets:
- wikipedia
---

# nagisa_bert

A BERT model for [nagisa](https://github.com/taishi-i/nagisa).
The model is available in [Transformers](https://github.com/huggingface/transformers) πŸ€—.

A tokenizer for nagisa_bert is available [here](https://github.com/taishi-i/nagisa_bert).

## Install

To use this model, the following python library must be installed.
You can install [*nagisa_bert*](https://github.com/taishi-i/nagisa_bert) by using the *pip* command.

Python 3.7+ on Linux or macOS is required.


```bash
pip install nagisa_bert
```

## Usage

This model is available in Transformer's pipeline method.

```python
from transformers import pipeline
from nagisa_bert import NagisaBertTokenizer

text = "nagisaで[MASK]できるヒデルです"
tokenizer = NagisaBertTokenizer.from_pretrained("taishi-i/nagisa_bert")
fill_mask = pipeline("fill-mask", model='taishi-i/nagisa_bert', tokenizer=tokenizer)
print(fill_mask(text))
```

```python
[{'score': 0.1385931372642517,
  'sequence': 'nagisa で 使用 できる ヒデル です',
  'token': 8092,
  'token_str': 'δ½Ώ 用'},
 {'score': 0.11947669088840485,
  'sequence': 'nagisa で εˆ©η”¨ できる ヒデル です',
  'token': 8252,
  'token_str': '利 用'},
 {'score': 0.04910655692219734,
  'sequence': 'nagisa で 作成 できる ヒデル です',
  'token': 9559,
  'token_str': '作 成'},
 {'score': 0.03792576864361763,
  'sequence': 'nagisa で θ³Όε…₯ できる ヒデル です',
  'token': 9430,
  'token_str': 'θ³Ό ε…₯'},
 {'score': 0.026893319562077522,
  'sequence': 'nagisa で ε…₯手 できる ヒデル です',
  'token': 11273,
  'token_str': 'ε…₯ 手'}]
```

Tokenization and vectorization.

```python
from transformers import BertModel
from nagisa_bert import NagisaBertTokenizer

text = "nagisaで[MASK]できるヒデルです"
tokenizer = NagisaBertTokenizer.from_pretrained("taishi-i/nagisa_bert")
tokens = tokenizer.tokenize(text)
print(tokens)
# ['na', '##g', '##is', '##a', 'で', '[MASK]', 'できる', 'ヒデル', 'です']

model = BertModel.from_pretrained("taishi-i/nagisa_bert")
h = model(**tokenizer(text, return_tensors="pt")).last_hidden_state
print(h)
```

```python
tensor([[[-0.2912, -0.6818, -0.4097,  ...,  0.0262, -0.3845,  0.5816],
         [ 0.2504,  0.2143,  0.5809,  ..., -0.5428,  1.1805,  1.8701],
         [ 0.1890, -0.5816, -0.5469,  ..., -1.2081, -0.2341,  1.0215],
         ...,
         [-0.4360, -0.2546, -0.2824,  ...,  0.7420, -0.2904,  0.3070],
         [-0.6598, -0.7607,  0.0034,  ...,  0.2982,  0.5126,  1.1403],
         [-0.2505, -0.6574, -0.0523,  ...,  0.9082,  0.5851,  1.2625]]],
       grad_fn=<NativeLayerNormBackward0>)
```

## Model description

### Architecture

The model architecture is the same as [the BERT bert-base-uncased architecture](https://huggingface.co./bert-base-uncased) (12 layers, 768 dimensions of hidden states, and 12 attention heads).

### Training Data

The models is trained on the Japanese version of Wikipedia. The training corpus is generated from the Wikipedia Cirrussearch dump file as of August 8, 2022 with [make_corpus_wiki.py](https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py) and [create_pretraining_data.py](https://github.com/cl-tohoku/bert-japanese/blob/main/create_pretraining_data.py).

### Training

The model is trained with the default parameters of [transformers.BertConfig](https://huggingface.co./docs/transformers/model_doc/bert#transformers.BertConfig).
Due to GPU memory limitations, the batch size is set to small; 16 instances per batch, and 2M training steps.

## Tutorial

You can find here a list of the notebooks on Japanese NLP using pre-trained models and transformers.

| Notebook     |      Description      |   |
|:----------|:-------------|------:|
| [Fill-mask](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/fill_mask-japanese_bert_models.ipynb)  | How to use the pipeline function in transformers to fill in Japanese text. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/fill_mask-japanese_bert_models.ipynb)|
| [Feature-extraction](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/feature_extraction-japanese_bert_models.ipynb)  | How to use the pipeline function in transformers to extract features from Japanese text. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/feature_extraction-japanese_bert_models.ipynb)|
| [Embedding visualization](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/embedding_visualization-japanese_bert_models.ipynb)  | Show how to visualize embeddings from Japanese pre-trained models. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/embedding_visualization_japanese_bert_models.ipynb)|
| [How to fine-tune a model on text classification](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/text_classification-amazon_reviews_ja.ipynb)  | Show how to fine-tune a pretrained model on a Japanese text classification task. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/text_classification-amazon_reviews_ja.ipynb)|
| [How to fine-tune a model on text classification with csv files](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/text_classification-csv_files.ipynb)  | Show how to preprocess the data and fine-tune a pretrained model on a Japanese text classification task. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/text_classification-csv_files.ipynb)|