File size: 2,826 Bytes
aad3a70
 
 
 
 
 
 
 
ddd7d53
aad3a70
 
 
 
 
 
ddd7d53
aad3a70
ddd7d53
 
 
 
 
 
 
 
aad3a70
ddd7d53
 
 
aad3a70
 
ddd7d53
 
 
 
 
036c9a3
ddd7d53
 
983ac63
ddd7d53
 
 
 
 
 
 
 
 
 
 
 
983ac63
 
 
ddd7d53
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aad3a70
 
 
1dcd241
aad3a70
1dcd241
 
 
 
 
 
aad3a70
1dcd241
ddd7d53
31b6263
1dcd241
ddd7d53
aad3a70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8e93147
aad3a70
 
 
 
 
 
ddd7d53
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
---
library_name: transformers
base_model: None
tags:
- generated_from_trainer
model-index:
- name: trial2
  results: []
license: apache-2.0
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->


## mistral-2b-base

Welcome to my model card!

This Model feature is ...

- trained by japanese
- trained in two stages: patch level and token level
- Suppression of unknown word generation by using byte fallback in SentencePiece tokenizer and conversion to huggingface Tokenizers format
- Use of Mistral 2B

Yukkuri shite ittene!

<!-- ## Intended uses & limitations

More information needed
 -->

## How to use the model

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_path = "ce-lery/mistral-2b-base"
torch.set_float32_matmul_precision('high')

device = "cuda"
if (device != "cuda" and device != "cpu"):
    device = "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_path,use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_path,
                                             trust_remote_code=True,
                                             ).to(device)

prompt = "自然言語処理とは、"
inputs = tokenizer(prompt,
                   add_special_tokens=True,
                   return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(
        inputs["input_ids"],
        max_new_tokens=4096,
        do_sample=True,
        early_stopping=False,
        top_p=0.95,
        top_k=50,
        temperature=0.7,
        no_repeat_ngram_size=2,
        num_beams=3
    )

print(outputs.tolist()[0])
outputs_txt = tokenizer.decode(outputs[0])
print(outputs_txt)

```

## Training and evaluation data

40B token. The contents are following.

- Wikipedia
- Wikibooks
- Wikiversity
- CC-100
- OSCAR2109
- mC4 (head 150GB)

## Training procedure

Please refer [ce-lery/mistral-2b-recipe](https://github.com/ce-lery/mistral-2b-recipe).  
The Guide for this repository is published [here](). It is written in Japanese.

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0002
- train_batch_size: 2
- eval_batch_size: 2
- seed: 42
- gradient_accumulation_steps: 128
- total_train_batch_size: 256
- optimizer: Use adamw_bnb_8bit with betas=(0.9,0.95) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine_with_min_lr
- lr_scheduler_warmup_steps: 1000
- num_epochs: 1.0

### Training results

Please refer [here](https://huggingface.co./ce-lery/mistral-2b-base/tensorboard).

### Framework versions

- Transformers 4.46.2
- Pytorch 2.4.0a0+f70bd71a48.nv24.06
- Datasets 2.20.0
- Tokenizers 0.20.3