File size: 5,993 Bytes
27de098 1859ee2 27de098 1859ee2 27de098 1859ee2 b514ced 1859ee2 b514ced 1859ee2 3b4a75e 1859ee2 3b4a75e 1859ee2 3b4a75e 1859ee2 3b4a75e 1859ee2 2187b7a 3b4a75e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
---
language: ja
license: mit
datasets:
- wikipedia
---
# nagisa_bert
A BERT model for [nagisa](https://github.com/taishi-i/nagisa).
The model is available in [Transformers](https://github.com/huggingface/transformers) π€.
A tokenizer for nagisa_bert is available [here](https://github.com/taishi-i/nagisa_bert).
## Install
To use this model, the following python library must be installed.
You can install [*nagisa_bert*](https://github.com/taishi-i/nagisa_bert) by using the *pip* command.
Python 3.7+ on Linux or macOS is required.
```bash
pip install nagisa_bert
```
## Usage
This model is available in Transformer's pipeline method.
```python
from transformers import pipeline
from nagisa_bert import NagisaBertTokenizer
text = "nagisaγ§[MASK]γ§γγγ’γγ«γ§γ"
tokenizer = NagisaBertTokenizer.from_pretrained("taishi-i/nagisa_bert")
fill_mask = pipeline("fill-mask", model='taishi-i/nagisa_bert', tokenizer=tokenizer)
print(fill_mask(text))
```
```python
[{'score': 0.1385931372642517,
'sequence': 'nagisa γ§ δ½Ώη¨ γ§γγ γ’γγ« γ§γ',
'token': 8092,
'token_str': 'δ½Ώ η¨'},
{'score': 0.11947669088840485,
'sequence': 'nagisa γ§ ε©η¨ γ§γγ γ’γγ« γ§γ',
'token': 8252,
'token_str': 'ε© η¨'},
{'score': 0.04910655692219734,
'sequence': 'nagisa γ§ δ½ζ γ§γγ γ’γγ« γ§γ',
'token': 9559,
'token_str': 'δ½ ζ'},
{'score': 0.03792576864361763,
'sequence': 'nagisa γ§ θ³Όε
₯ γ§γγ γ’γγ« γ§γ',
'token': 9430,
'token_str': 'θ³Ό ε
₯'},
{'score': 0.026893319562077522,
'sequence': 'nagisa γ§ ε
₯ζ γ§γγ γ’γγ« γ§γ',
'token': 11273,
'token_str': 'ε
₯ ζ'}]
```
Tokenization and vectorization.
```python
from transformers import BertModel
from nagisa_bert import NagisaBertTokenizer
text = "nagisaγ§[MASK]γ§γγγ’γγ«γ§γ"
tokenizer = NagisaBertTokenizer.from_pretrained("taishi-i/nagisa_bert")
tokens = tokenizer.tokenize(text)
print(tokens)
# ['na', '##g', '##is', '##a', 'γ§', '[MASK]', 'γ§γγ', 'γ’γγ«', 'γ§γ']
model = BertModel.from_pretrained("taishi-i/nagisa_bert")
h = model(**tokenizer(text, return_tensors="pt")).last_hidden_state
print(h)
```
```python
tensor([[[-0.2912, -0.6818, -0.4097, ..., 0.0262, -0.3845, 0.5816],
[ 0.2504, 0.2143, 0.5809, ..., -0.5428, 1.1805, 1.8701],
[ 0.1890, -0.5816, -0.5469, ..., -1.2081, -0.2341, 1.0215],
...,
[-0.4360, -0.2546, -0.2824, ..., 0.7420, -0.2904, 0.3070],
[-0.6598, -0.7607, 0.0034, ..., 0.2982, 0.5126, 1.1403],
[-0.2505, -0.6574, -0.0523, ..., 0.9082, 0.5851, 1.2625]]],
grad_fn=<NativeLayerNormBackward0>)
```
## Model description
### Architecture
The model architecture is the same as [the BERT bert-base-uncased architecture](https://huggingface.co./bert-base-uncased) (12 layers, 768 dimensions of hidden states, and 12 attention heads).
### Training Data
The models is trained on the Japanese version of Wikipedia. The training corpus is generated from the Wikipedia Cirrussearch dump file as of August 8, 2022 with [make_corpus_wiki.py](https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py) and [create_pretraining_data.py](https://github.com/cl-tohoku/bert-japanese/blob/main/create_pretraining_data.py).
### Training
The model is trained with the default parameters of [transformers.BertConfig](https://huggingface.co./docs/transformers/model_doc/bert#transformers.BertConfig).
Due to GPU memory limitations, the batch size is set to small; 16 instances per batch, and 2M training steps.
## Tutorial
You can find here a list of the notebooks on Japanese NLP using pre-trained models and transformers.
| Notebook | Description | |
|:----------|:-------------|------:|
| [Fill-mask](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/fill_mask-japanese_bert_models.ipynb) | How to use the pipeline function in transformers to fill in Japanese text. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/fill_mask-japanese_bert_models.ipynb)|
| [Feature-extraction](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/feature_extraction-japanese_bert_models.ipynb) | How to use the pipeline function in transformers to extract features from Japanese text. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/feature_extraction-japanese_bert_models.ipynb)|
| [Embedding visualization](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/embedding_visualization-japanese_bert_models.ipynb) | Show how to visualize embeddings from Japanese pre-trained models. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/embedding_visualization_japanese_bert_models.ipynb)|
| [How to fine-tune a model on text classification](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/text_classification-amazon_reviews_ja.ipynb) | Show how to fine-tune a pretrained model on a Japanese text classification task. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/text_classification-amazon_reviews_ja.ipynb)|
| [How to fine-tune a model on text classification with csv files](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/text_classification-csv_files.ipynb) | Show how to preprocess the data and fine-tune a pretrained model on a Japanese text classification task. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/text_classification-csv_files.ipynb)|
|