Text Generation
Transformers
Safetensors
Japanese
English
mistral
conversational
text-generation-inference
4-bit precision
gptq
File size: 50,911 Bytes
f79caa5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
---
base_model: augmxnt/shisa-7b-v1
datasets:
- augmxnt/ultra-orca-boros-en-ja-v1
- Open-Orca/SlimOrca
- augmxnt/shisa-en-ja-dpo-v1
inference: false
language:
- ja
- en
license: apache-2.0
model_creator: AUGMXNT
model_name: Shisa 7B v1
model_type: mistral
prompt_template: '[INST] <<SYS>>

  {system_message}

  <</SYS>>

  {prompt} [/INST]

  '
quantized_by: TheBloke
---
<!-- markdownlint-disable MD041 -->

<!-- header start -->
<!-- 200823 -->
<div style="width: auto; margin-left: auto; margin-right: auto">
<img src="https://i.imgur.com/EBdldam.jpg" alt="TheBlokeAI" style="width: 100%; min-width: 400px; display: block; margin: auto;">
</div>
<div style="display: flex; justify-content: space-between; width: 100%;">
    <div style="display: flex; flex-direction: column; align-items: flex-start;">
        <p style="margin-top: 0.5em; margin-bottom: 0em;"><a href="https://discord.gg/theblokeai">Chat & support: TheBloke's Discord server</a></p>
    </div>
    <div style="display: flex; flex-direction: column; align-items: flex-end;">
        <p style="margin-top: 0.5em; margin-bottom: 0em;"><a href="https://www.patreon.com/TheBlokeAI">Want to contribute? TheBloke's Patreon page</a></p>
    </div>
</div>
<div style="text-align:center; margin-top: 0em; margin-bottom: 0em"><p style="margin-top: 0.25em; margin-bottom: 0em;">TheBloke's LLM work is generously supported by a grant from <a href="https://a16z.com">andreessen horowitz (a16z)</a></p></div>
<hr style="margin-top: 1.0em; margin-bottom: 1.0em;">
<!-- header end -->

# Shisa 7B v1 - GPTQ
- Model creator: [AUGMXNT](https://huggingface.co./augmxnt)
- Original model: [Shisa 7B v1](https://huggingface.co./augmxnt/shisa-7b-v1)

<!-- description start -->
# Description

This repo contains GPTQ model files for [AUGMXNT's Shisa 7B v1](https://huggingface.co./augmxnt/shisa-7b-v1).

Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.

These files were quantised using hardware kindly provided by [Massed Compute](https://massedcompute.com/).

<!-- description end -->
<!-- repositories-available start -->
## Repositories available

* [AWQ model(s) for GPU inference.](https://huggingface.co./TheBloke/shisa-7B-v1-AWQ)
* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co./TheBloke/shisa-7B-v1-GPTQ)
* [AUGMXNT's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co./augmxnt/shisa-7b-v1)
<!-- repositories-available end -->

<!-- prompt-template start -->
## Prompt template: Llama-2-Chat

```
[INST] <<SYS>>
{system_message}
<</SYS>>
{prompt} [/INST]

```

<!-- prompt-template end -->



<!-- README_GPTQ.md-compatible clients start -->
## Known compatible clients / servers

GPTQ models are currently supported on Linux (NVidia/AMD) and Windows (NVidia only). macOS users: please use GGUF models.

These GPTQ models are known to work in the following inference servers/webuis.

- [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
- [KoboldAI United](https://github.com/henk717/koboldai)
- [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui)
- [Hugging Face Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference)

This may not be a complete list; if you know of others, please let me know!
<!-- README_GPTQ.md-compatible clients end -->

<!-- README_GPTQ.md-provided-files start -->
## Provided files, and GPTQ parameters

Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.

Each separate quant is in a different branch.  See below for instructions on fetching from different branches.

Most GPTQ files are made with AutoGPTQ. Mistral models are currently made with Transformers.

<details>
  <summary>Explanation of GPTQ parameters</summary>

- Bits: The bit size of the quantised model.
- GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
- Act Order: True or False. Also known as `desc_act`. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now.
- Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
- GPTQ dataset: The calibration dataset used during quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ calibration dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
- Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
- ExLlama Compatibility: Whether this file can be loaded with ExLlama, which currently only supports Llama and Mistral models in 4-bit.

</details>

| Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
| ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- |
| [main](https://huggingface.co./TheBloke/shisa-7B-v1-GPTQ/tree/main) | 4 | 128 | Yes | 0.1 | [Shisa English Japanese DPO](https://huggingface.co./datasets/augmxnt/shisa-en-ja-dpo-v1/viewer/) | 4096 | 5.60 GB | Yes | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. | 
| [gptq-4bit-32g-actorder_True](https://huggingface.co./TheBloke/shisa-7B-v1-GPTQ/tree/gptq-4bit-32g-actorder_True) | 4 | 32 | Yes | 0.1 | [Shisa English Japanese DPO](https://huggingface.co./datasets/augmxnt/shisa-en-ja-dpo-v1/viewer/) | 4096 | 6.01 GB | Yes | 4-bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage. | 
| [gptq-8bit--1g-actorder_True](https://huggingface.co./TheBloke/shisa-7B-v1-GPTQ/tree/gptq-8bit--1g-actorder_True) | 8 | None | Yes | 0.1 | [Shisa English Japanese DPO](https://huggingface.co./datasets/augmxnt/shisa-en-ja-dpo-v1/viewer/) | 4096 | 8.96 GB | No | 8-bit, with Act Order. No group size, to lower VRAM requirements. | 
| [gptq-8bit-128g-actorder_True](https://huggingface.co./TheBloke/shisa-7B-v1-GPTQ/tree/gptq-8bit-128g-actorder_True) | 8 | 128 | Yes | 0.1 | [Shisa English Japanese DPO](https://huggingface.co./datasets/augmxnt/shisa-en-ja-dpo-v1/viewer/) | 4096 | 9.12 GB | No | 8-bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. | 
| [gptq-8bit-32g-actorder_True](https://huggingface.co./TheBloke/shisa-7B-v1-GPTQ/tree/gptq-8bit-32g-actorder_True) | 8 | 32 | Yes | 0.1 | [Shisa English Japanese DPO](https://huggingface.co./datasets/augmxnt/shisa-en-ja-dpo-v1/viewer/) | 4096 | 9.61 GB | No | 8-bit, with group size 32g and Act Order for maximum inference quality. | 
| [gptq-4bit-64g-actorder_True](https://huggingface.co./TheBloke/shisa-7B-v1-GPTQ/tree/gptq-4bit-64g-actorder_True) | 4 | 64 | Yes | 0.1 | [Shisa English Japanese DPO](https://huggingface.co./datasets/augmxnt/shisa-en-ja-dpo-v1/viewer/) | 4096 | 5.74 GB | Yes | 4-bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy. |

<!-- README_GPTQ.md-provided-files end -->

<!-- README_GPTQ.md-download-from-branches start -->
## How to download, including from branches

### In text-generation-webui

To download from the `main` branch, enter `TheBloke/shisa-7B-v1-GPTQ` in the "Download model" box.

To download from another branch, add `:branchname` to the end of the download name, eg `TheBloke/shisa-7B-v1-GPTQ:gptq-4bit-32g-actorder_True`

### From the command line

I recommend using the `huggingface-hub` Python library:

```shell
pip3 install huggingface-hub
```

To download the `main` branch to a folder called `shisa-7B-v1-GPTQ`:

```shell
mkdir shisa-7B-v1-GPTQ
huggingface-cli download TheBloke/shisa-7B-v1-GPTQ --local-dir shisa-7B-v1-GPTQ --local-dir-use-symlinks False
```

To download from a different branch, add the `--revision` parameter:

```shell
mkdir shisa-7B-v1-GPTQ
huggingface-cli download TheBloke/shisa-7B-v1-GPTQ --revision gptq-4bit-32g-actorder_True --local-dir shisa-7B-v1-GPTQ --local-dir-use-symlinks False
```

<details>
  <summary>More advanced huggingface-cli download usage</summary>

If you remove the `--local-dir-use-symlinks False` parameter, the files will instead be stored in the central Hugging Face cache directory (default location on Linux is: `~/.cache/huggingface`), and symlinks will be added to the specified `--local-dir`, pointing to their real location in the cache. This allows for interrupted downloads to be resumed, and allows you to quickly clone the repo to multiple places on disk without triggering a download again. The downside, and the reason why I don't list that as the default option, is that the files are then hidden away in a cache folder and it's harder to know where your disk space is being used, and to clear it up if/when you want to remove a download model.

The cache location can be changed with the `HF_HOME` environment variable, and/or the `--cache-dir` parameter to `huggingface-cli`.

For more documentation on downloading with `huggingface-cli`, please see: [HF -> Hub Python Library -> Download files -> Download from the CLI](https://huggingface.co./docs/huggingface_hub/guides/download#download-from-the-cli).

To accelerate downloads on fast connections (1Gbit/s or higher), install `hf_transfer`:

```shell
pip3 install hf_transfer
```

And set environment variable `HF_HUB_ENABLE_HF_TRANSFER` to `1`:

```shell
mkdir shisa-7B-v1-GPTQ
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download TheBloke/shisa-7B-v1-GPTQ --local-dir shisa-7B-v1-GPTQ --local-dir-use-symlinks False
```

Windows Command Line users: You can set the environment variable by running `set HF_HUB_ENABLE_HF_TRANSFER=1` before the download command.
</details>

### With `git` (**not** recommended)

To clone a specific branch with `git`, use a command like this:

```shell
git clone --single-branch --branch gptq-4bit-32g-actorder_True https://huggingface.co./TheBloke/shisa-7B-v1-GPTQ
```

Note that using Git with HF repos is strongly discouraged. It will be much slower than using `huggingface-hub`, and will use twice as much disk space as it has to store the model files twice (it stores every byte both in the intended target folder, and again in the `.git` folder as a blob.)

<!-- README_GPTQ.md-download-from-branches end -->
<!-- README_GPTQ.md-text-generation-webui start -->
## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui)

Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).

It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.

1. Click the **Model tab**.
2. Under **Download custom model or LoRA**, enter `TheBloke/shisa-7B-v1-GPTQ`.

    - To download from a specific branch, enter for example `TheBloke/shisa-7B-v1-GPTQ:gptq-4bit-32g-actorder_True`
    - see Provided Files above for the list of branches for each option.

3. Click **Download**.
4. The model will start downloading. Once it's finished it will say "Done".
5. In the top left, click the refresh icon next to **Model**.
6. In the **Model** dropdown, choose the model you just downloaded: `shisa-7B-v1-GPTQ`
7. The model will automatically load, and is now ready for use!
8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.

    - Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.

9. Once you're ready, click the **Text Generation** tab and enter a prompt to get started!

<!-- README_GPTQ.md-text-generation-webui end -->

<!-- README_GPTQ.md-use-from-tgi start -->
## Serving this model from Text Generation Inference (TGI)

It's recommended to use TGI version 1.1.0 or later. The official Docker container is: `ghcr.io/huggingface/text-generation-inference:1.1.0`

Example Docker parameters:

```shell
--model-id TheBloke/shisa-7B-v1-GPTQ --port 3000 --quantize gptq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096
```

Example Python code for interfacing with TGI (requires huggingface-hub 0.17.0 or later):

```shell
pip3 install huggingface-hub
```

```python
from huggingface_hub import InferenceClient

endpoint_url = "https://your-endpoint-url-here"

prompt = "Tell me about AI"
prompt_template=f'''[INST] <<SYS>>
{system_message}
<</SYS>>
{prompt} [/INST]
'''

client = InferenceClient(endpoint_url)
response = client.text_generation(prompt,
                                  max_new_tokens=128,
                                  do_sample=True,
                                  temperature=0.7,
                                  top_p=0.95,
                                  top_k=40,
                                  repetition_penalty=1.1)

print(f"Model output: {response}")
```
<!-- README_GPTQ.md-use-from-tgi end -->
<!-- README_GPTQ.md-use-from-python start -->
## Python code example: inference from this GPTQ model

### Install the necessary packages

Requires: Transformers 4.33.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later.

```shell
pip3 install --upgrade transformers optimum
# If using PyTorch 2.1 + CUDA 12.x:
pip3 install --upgrade auto-gptq
# or, if using PyTorch 2.1 + CUDA 11.x:
pip3 install --upgrade auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
```

If you are using PyTorch 2.0, you will need to install AutoGPTQ from source. Likewise if you have problems with the pre-built wheels, you should try building from source:

```shell
pip3 uninstall -y auto-gptq
git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
git checkout v0.5.1
pip3 install .
```

### Example Python code

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_name_or_path = "TheBloke/shisa-7B-v1-GPTQ"
# To use a different branch, change revision
# For example: revision="gptq-4bit-32g-actorder_True"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision="main")

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

prompt = "Tell me about AI"
prompt_template=f'''[INST] <<SYS>>
{system_message}
<</SYS>>
{prompt} [/INST]
'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
print(tokenizer.decode(output[0]))

# Inference can also be done using transformers' pipeline

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1
)

print(pipe(prompt_template)[0]['generated_text'])
```
<!-- README_GPTQ.md-use-from-python end -->

<!-- README_GPTQ.md-compatibility start -->
## Compatibility

The files provided are tested to work with Transformers. For non-Mistral models, AutoGPTQ can also be used directly.

[ExLlama](https://github.com/turboderp/exllama) is compatible with Llama and Mistral models in 4-bit. Please see the Provided Files table above for per-file compatibility.

For a list of clients/servers, please see "Known compatible clients / servers", above.
<!-- README_GPTQ.md-compatibility end -->

<!-- footer start -->
<!-- 200823 -->
## Discord

For further support, and discussions on these models and AI in general, join us at:

[TheBloke AI's Discord server](https://discord.gg/theblokeai)

## Thanks, and how to contribute

Thanks to the [chirper.ai](https://chirper.ai) team!

Thanks to Clay from [gpus.llm-utils.org](llm-utils)!

I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training.

If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects.

Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits.

* Patreon: https://patreon.com/TheBlokeAI
* Ko-Fi: https://ko-fi.com/TheBlokeAI

**Special thanks to**: Aemon Algiz.

**Patreon special mentions**: Michael Levine, 阿明, Trailburnt, Nikolai Manek, John Detwiler, Randy H, Will Dee, Sebastain Graf, NimbleBox.ai, Eugene Pentland, Emad Mostaque, Ai Maven, Jim Angel, Jeff Scroggin, Michael Davis, Manuel Alberto Morcote, Stephen Murray, Robert, Justin Joy, Luke @flexchar, Brandon Frisco, Elijah Stavena, S_X, Dan Guido, Undi ., Komninos Chatzipapas, Shadi, theTransient, Lone Striker, Raven Klaugh, jjj, Cap'n Zoog, Michel-Marie MAUDET (LINAGORA), Matthew Berman, David, Fen Risland, Omer Bin Jawed, Luke Pendergrass, Kalila, OG, Erik Bjäreholt, Rooh Singh, Joseph William Delisle, Dan Lewis, TL, John Villwock, AzureBlack, Brad, Pedro Madruga, Caitlyn Gatomon, K, jinyuan sun, Mano Prime, Alex, Jeffrey Morgan, Alicia Loh, Illia Dulskyi, Chadd, transmissions 11, fincy, Rainer Wilmers, ReadyPlayerEmma, knownsqashed, Mandus, biorpg, Deo Leter, Brandon Phillips, SuperWojo, Sean Connelly, Iucharbius, Jack West, Harry Royden McLaughlin, Nicholas, terasurfer, Vitor Caleffi, Duane Dunston, Johann-Peter Hartmann, David Ziegler, Olakabola, Ken Nordquist, Trenton Dambrowitz, Tom X Nguyen, Vadim, Ajan Kanaga, Leonard Tan, Clay Pascal, Alexandros Triantafyllidis, JM33133, Xule, vamX, ya boyyy, subjectnull, Talal Aujan, Alps Aficionado, wassieverse, Ari Malik, James Bentley, Woland, Spencer Kim, Michael Dempsey, Fred von Graf, Elle, zynix, William Richards, Stanislav Ovsiannikov, Edmond Seymore, Jonathan Leane, Martin Kemka, usrbinkat, Enrico Ros


Thank you to all my generous patrons and donaters!

And thank you again to a16z for their generous grant.

<!-- footer end -->

# Original model card: AUGMXNT's Shisa 7B v1

# Shisa 7B

![Shi-chan and Sa-chan/シーちゃんとサーちゃん](https://huggingface.co./augmxnt/shisa-7b-v1/resolve/main/shisa.webp)

**Shisa 7B** (`shisa-7b-v1`) is a bilingual Japanese and English (JA/EN) general-purpose chat model that aims to achieve strong Japanese language performance while retaining robust English capabilities, using a synthetic-data driven approach.

This model is based on [Mistral 7B](https://huggingface.co./mistralai/Mistral-7B-v0.1) with a custom JA-optimized extended tokenizer that is >2X more efficient in Japanese than Mistral's original tokenizer. The base model was pre-trained for an additional 8B primarily Japanese tokens. It was then subsequently fine-tuned with an expanded, machine-translated version of [airoboros-3.1](https://huggingface.co./datasets/jondurbin/airoboros-3.1), a set of the highest-scoring items from [ultrafeedback_binarized](https://huggingface.co./datasets/HuggingFaceH4/ultrafeedback_binarized), and additional freshly generated [airoboros](https://github.com/jondurbin/airoboros) data directly to the target languages.

We also release our base model, datasets, and pipeline code under a permissive Apache 2.0 license which can be used for any purpose, commercial or otherwise:
* [shisa-base-7b-v1](https://huggingface.co./augmxnt/shisa-base-7b-v1) - our base model w/ an extended tokenizer and additional JA pre-training
* [shisa-pretrain-en-ja-v1](https://huggingface.co./datasets/augmxnt/shisa-pretrain-en-ja-v1) - our pre-training data set
* [ultra-orca-boros-en-ja](https://huggingface.co./datasets/augmxnt/ultra-orca-boros-en-ja-v1) - a synthetically generated, machine-translated, programmatically validated JA/EN fine-tuning dataset
* [shisa-en-ja-dpo-v1](https://huggingface.co./datasets/augmxnt/shisa-en-ja-dpo-v1) - Small subset of DPO pairs from ultrafeedback, along with JA DPO pairs using GPT-4 generated items as the chosen value, and outputs from our preliminary 7b model as the rejected values
* [Shisa repository](https://github.com/AUGMXNT/shisa) - this includes our translation, dataset generation, training, and evaluation code

Moreover, we are in the process of publishing extended writeups and more details of our process, including ablation results, testing methodology, and key findings [on our project wiki](https://github.com/AUGMXNT/shisa/wiki) that may be of interest to fellow researchers.

## Fine-Tuning
Our original intuition was to see if we could create a stronger Japanese model using the best [existing public JA training sets](https://github.com/AUGMXNT/shisa/wiki/A-Review-of-Public-Japanese-Training-Sets) and incorporating them. After initial review and testing, however, we decided that focusing solely on translation/generation of our own synthetic datasets could yield superior results with less training.

We compared multiple translation tools and, via manual review, judged that while `gpt-4` almost always delivered the highest quality translations, Google's `text-bison-32k` was a good balance of quality, cost and throughput. Over various iterations, we refined our translation approach to include some additional algorithms for flagging and filtering invalid translations, re-translating and backfilling as necessary.

We also took this project as an opportunity to apply some newer techniques such as incorporating [NEFTune](https://arxiv.org/abs/2310.05914) and [DPO](https://arxiv.org/abs/2305.18290) training.

For our v1 release, we picked from our release candidates based on a significant amount of human preference testing (thousands of generations and multiple rounds of pairwise comparisons). We analyzed our results with both win/loss/draw and [BTL modeling](https://datascience.oneoffcoder.com/btl-model.html) (iLSR) using [choix](https://github.com/lucasmaystre/choix)).


The best candidate model was fine-tuned in a 3-step process:

1. First, the model was fine-tuned on `ultra-orca-boros-en-ja` and SlimOrca ([WandB Log](https://wandb.ai/jondurbin/shisa-7b-v1/runs/k8pfog9d/overview)) 
2. Next, we add one additional epoch at performed using only a subset of Japanese ultra-orca-boros-en-ja items to enhance JA performance (as SlimOrca from the first step is mostly EN) ([WandB Log](https://wandb.ai/jondurbin/shisa-mega-7b-v1.1/runs/dopsr0o7/overview))
3. Finally, the model was tuned using a DPOTrainer on a small subset of ultrafeedback (EN) and our own JA DPO dataset which uses gpt-4 outputs as the chosen values and outputs from stage 1's prelim model as rejected values. ([WandDB Log](https://wandb.ai/jondurbin/shisa-mega-dpo-7b-v1.1) )

During our training process, we also gained some key insights on [why some existing Japanese models seem to underperform](https://github.com/AUGMXNT/shisa/wiki/A-Review-of-Public-Japanese-Training-Sets#analysis) even versus models that have no additional JA training, and we hope that sharing this analysis will be useful to other teams developing Japanese language models.

While we need to explore this further, as an experimental validation, we applied a version of our fine-tuning set onto an existing base model ("Gamma 7B") and the initial JA MT-Bench results suggests that we can drastically increase functional performance with our tuning approach:

| Model                          | Score |
| ------------------------------ | ----- |
| shisa-gamma-7b-allsources-v0.4 |  5.65 |
| ja-stablelm-instruct-gamma-7b* |  4.01 |


## Performance
Throughout our training, we did extensive human evaluation for each model to cross-validate our model performance, and we are currently conducting ongoing larger scale manual head-to-head testing between models. Our intention is open up and scale this data collection as we further develop our tools. For more information and updates, please see our [project wiki](https://github.com/AUGMXNT/shisa/wiki).

While we believe [llm-jp-eval](https://github.com/llm-jp/llm-jp-eval) is a useful metric for our [base model](https://huggingface.co./augmxnt/shisa-base-7b-v1), and it was extremely useful during our tuning process for initial validations, as our fine-tune training includes a percentage of the benchmark train splits, we provide these llm-jp-eval results primarily as a point of interest:

| AVR   | MC    | NLI   | QA    | RC    |
|-------|-------|-------|-------|-------|
| 0.7480| 0.8900| 0.8040| 0.4153| 0.8825|

*(We run a [slightly modified llm-jp-eval](https://github.com/llm-jp/llm-jp-eval/compare/main...AUGMXNT:llm-jp-eval:main) to support testing of Qwen and to emit a `bos_token` if available)*

For our final model, since it's customary to include benchmarks, we've used Stability AI Japan's [Japanese MT-Bench](https://github.com/Stability-AI/FastChat) as a more representative test of our model's capabilities. For [our JA MT-Bench testing](https://github.com/Stability-AI/FastChat/compare/jp-stable...AUGMXNT:FastChat:jp-stable) we use a Japanese prompt ("あなたは役立つアシスタントです。") as well as `--num-choices 4` in an effort to reduce sampling variability, however we've still observed regular 0.5+ point (and sometimes even greater swings) between generations, as well as issues with default prompts and parameters when testing, so again, we'd urge caution in over-interpreting these scores and treating them as more of a probabilistic directional indicator, rather than a definitive score or ranking: 

| Benchmark   | Score |
| ----------- | ----- |
| JA MT-Bench |  5.02 |
| MT-Bench    |  5.71 |

There is an [MT-Bench Leaderboard](https://huggingface.co./spaces/lmsys/chatbot-arena-leaderboard), but as JA MT-Bench is still under development, for convenience, here is a comparison of the JA MT-Bench scores of some other models (our scores were rated by `gpt-4-0613`):

| Model                                             | Score |
| ------------------------------------------------- | ---- |
| gpt-4-0613                                        | 9.40 |
| gpt-4-1106-preview                                | 9.17 |
| gpt-3.5-turbo*                                    | 8.41 |
| Qwen-14B-Chat                                     | 7.47 |
| **shisa-7b-v1**                              | **5.02** |
| ELYZA-japanese-Llama-2-7b-fast-instruct*          | 4.86 |
| ja-stablelm-instruct-gamma-7b*                    | 4.01 |
| japanese-stablelm-instruct-alpha-7b*              | 2.74 |
| Mistral-7B-OpenOrca-ja*                           | 2.23 |
| youri-7b-chat*                                    | 2.00 |
| Mistral-7B-Instruct-v0.1*                         | 1.78 |
| llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0* | 1.31 |

*(Marked JA MT-Bench results in this section are [sourced from shi3z](https://note.com/shi3zblog/n/n6b2ac5874021))*

## Limitations
Although our model demonstrates a reasonably high level of Japanese fluency,  as a 7B parameter model, it is prone to higher hallucination rates and less effective instruction following and reasoning than larger-class models. Also, it still does not have complete mastery of the Japanese language and a native speaker will spot occasional mistakes like some non-idiomatic/awkward phrasing, improper tenses/speech levels, etc.

We've also noticed a small amount of language leakage, likely largely attributable to our tokenizer expansion. These may be fixable with sampler settings like [Min P](https://www.reddit.com/r/LocalLLaMA/comments/17vonjo/your_settings_are_probably_hurting_your_model_why/)) or additional targeted training, and we plan on doing additional work on automated detection/sampler sweeps in the future. One interesting observation is, based on our data collection, we found that as we iterated, the DPO process significantly exacerbated this issue, but also that our DPO models still had significantly higher human preference rates, so there was a bit of a trade-off in our choice of final tune.

While we believe that training larger models can improve performance using our existing approach and dataset, there are also many improvements we'd like to make for future models. We believe there is quite a bit of low hanging fruit for improving performance with even more training efficiency largely through improving the quality and construction of datasets.

## Usage
Sample code:
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

model_name = "augmxnt/shisa-7b-v1"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    torch_dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16,
    device_map="auto"
)
streamer = TextStreamer(tokenizer, skip_prompt=True)

# The prompt template is included in the  model's tokenizer_config.json so you shouldn't need this but we've included this for convenience
# tokenizer.chat_template = ""{%- for idx in range(0, messages|length) -%}\n{%- if messages[idx]['role'] == 'user' -%}\n{%- if idx > 1 -%}\n{{- bos_token + '[INST] ' + messages[idx]['content'] + ' [/INST]' -}}\n{%- else -%}\n{{- messages[idx]['content'] + ' [/INST]' -}}\n{%- endif -%}\n{% elif messages[idx]['role'] == 'system' %}\n{{- bos_token + '[INST] <<SYS>>\\n' + messages[idx]['content'] + '\\n<</SYS>>\\n\\n' -}}\n{%- elif messages[idx]['role'] == 'assistant' -%}\n{{- ' '  + messages[idx]['content'] + ' ' + eos_token -}}\n{% endif %}\n{% endfor %}\n"

# A more typical prompt: あなたは役に立つアシスタントです。("You are a helpful assistant.")

# You are an avid Pokemon fanatic.
prompt = "あなたは熱狂的なポケモンファンです。"
chat = [{"role": "system", "content": prompt}]

# Who is the most powerful Pokemon? Explain your choice.
user_input = "最強のポケモンは誰ですか?その選択理由を説明してください。"
chat.append({"role": "user", "content": user_input})

# Generate - add_generation_prompt to make sure it continues as assistant
inputs = tokenizer.apply_chat_template(chat, add_generation_prompt=True, return_tensors="pt")
# For multi-GPU, find the device of the first parameter of the model
first_param_device = next(model.parameters()).device
inputs = inputs.to(first_param_device)

with torch.no_grad():
    outputs = model.generate(
        inputs,
        pad_token_id=tokenizer.eos_token_id,
        max_new_tokens=1000,
        temperature=0.7,
        repetition_penalty=1.05,
        top_p=0.95,
        do_sample=True,
        streamer=streamer,
    )

# Add just the new tokens to our chat
new_tokens = outputs[0, inputs.size(1):]
response = tokenizer.decode(new_tokens, skip_special_tokens=True)
chat.append({"role": "assistant", "content": response})
```

## Prompt format
The prompt format is llama-2 chat:

```
[INST] <<SYS>>
You are a helpful, unbiased, uncensored assistant.
<</SYS>>
{prompt} [/INST]
```

For multi-turn, the prompt format is as follows:
```
[INST] <<SYS>>
You are a helful, unbiased, uncensored assistant.
<</SYS>>
{prompt 0} [/INST] {response 0} </s><s>[INST] {prompt 1} [/INST] {response 1} </s><s>...[INST] {prompt N} [/INST]
```

This [prompt template](https://huggingface.co./docs/transformers/main/chat_templating) is included in the tokenizer config, and can use the huggingface tokenizer `apply_chat_template` method, e.g.:

```
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained('augmxnt/shisa-7b-v1')
chat = [
  {"role": "system", "content": "You are Aiko, a friendly AI assistant."},
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
  {"role": "user", "content": "I'd like to show off how chat templating works!"},
]
print(tokenizer.apply_chat_template(chat, tokenize=False))
```

**NOTE:** For proper responses, you should be using our `bos_token` (`<s>`) to begin a string. This is automatically generated by `tokenizer.encode()` but if you are crafting a custom template or using an encoding method that skips special tokens, you may have to add this yourself.

## Acknowledgements
Team: [Leonard Lin](https://huggingface.co./randomfoo) and [Jon Durbin](https://huggingface.co./jondurbin), Mariko Sato, and Florian von Bock

Compute for this model was generously sponsored by [AKA Virtual](https://akavirtual.com/) (Tokyo, Japan).

Thanks to the [LLM-jp](https://llm-jp.nii.ac.jp/), [Stability AI Japan](https://ja.stability.ai/), and [LMSYS](https://lmsys.org/) teams for their work on llm-jp-eval, Japanese MT-Bench, MT-Bench.

Also, thanks to all the volunteers that provided invaluable human preference testing!

We are actively looking for additional compute as we train better and larger models for this project. Please drop us a line at: *compute at augmxnt dot com*

---
*(GPT-4によって非常に軽微な編集を加えて翻訳されました)*

# シーサー7B

**シーサー7B**`shisa-7b-v1`)は、合成データ駆動のアプローチを用いて、優れた日本語と英語能力を両立することを目指すバイリンガル(日本語/英語)汎用チャットモデルです。

このモデルは、[Mistral 7B](https://huggingface.co./mistralai/Mistral-7B-v0.1)を基に、Mistralのオリジナルのトークナイザーよりも日本語において2倍以上効率的な、日本語最適化拡張トークナイザーをカスタムして作成されました。ベースモデルは、主に日本語のトークンを追加で80億ものトレーニングを行いました。そして、その後、[airoboros-3.1](https://huggingface.co./datasets/jondurbin/airoboros-3.1)の拡張された機械翻訳版、[ultrafeedback_binarized](https://huggingface.co./datasets/HuggingFaceH4/ultrafeedback_binarized)からの最高得点項目のセット、そして新たに生成された[airoboros](https://github.com/jondurbin/airoboros)のデータを直接目標言語で微調整しています。

商用を含むあらゆる目的で使用可能な寛容なApache 2.0ライセンスの下で、ベースモデル、データセット、およびパイプラインコードも公開しています:
* [shisa-base-7b-v1](https://huggingface.co./augmxnt/shisa-base-7b-v1) - 拡張トークナイザーと追加の日本語プレトレーニングを備えた当方のベースモデル
* [shisa-pretrain-en-ja-v1](https://huggingface.co./datasets/augmxnt/shisa-pretrain-en-ja-v1) - 当方のプレトレーニングデータセット
* [ultra-orca-boros-en-ja](https://huggingface.co./datasets/jondurbin/ultra-orca-boros-en-ja) - 合成生成、機械翻訳、プログラムによる検証によるJA/EN微調整データセット
* [shisa-en-ja-dpo-v1](https://huggingface.co./datasets/augmxnt/shisa-en-ja-dpo-v1) - ultrafeedbackからのDPOペアの小さなサブセットと、選択された値としてGPT-4生成項目を使用した日本語のDPOペア、そして初期の7ビリオンモデルの出力を却下した値
* [シーサーリポジトリ](https://github.com/AUGMXNT/shisa) - 翻訳、データセットの生成、トレーニング、評価コードなどが含まれています

さらに、アブレーション結果、テスト方法論、主要な調査結果など、プロセスの詳細や拡張ライトアップを公開する過程にあります。これは[当プロジェクトwiki](https://github.com/AUGMXNT/shisa/wiki)で研究者に興味深い情報として提供されています。

## 微調整

最初の直感は、最良の[既存の公開日本語トレーニングセット](https://github.com/AUGMXNT/shisa/wiki/A-Review-of-Public-Japanese-Training-Sets)を使用して、それらを組み入れることでより強力な日本語モデルを作成できるかどうかを見ることでした。しかし、初期の検討とテストの後、自らの合成データセットの翻訳/生成にだけ焦点を当てることで、短期間のトレーニングで優れた結果を得ることができると結論付けました。

私たちは複数の翻訳ツールを比較し、手動でレビューを行った結果、`gpt-4`がほぼ常に最高品質の翻訳を提供しながら、Googleの `text-bison-32k`は品質、コスト、スループットのバランスが良いと判断しました。複数の繰り返しを経て、無効な翻訳のフラグ付けとフィルタリング、必要に応じた再翻訳とバックフィルのための追加のアルゴリズムを含むように、翻訳アプローチを洗練させました。

また、このプロジェクトを[NEFTune](https://arxiv.org/abs/2310.05914)と[DPO](https://arxiv.org/abs/2305.18290)トレーニングを取り入れるなど、新しい技術を適用する機会ともなりました。

v1リリースのために、私たちは大量の人間の嗜好テスト(数千の生成と複数ラウンドのペアワイズ比較)に基づいてリリース候補から選択しました。私たちは、勝ち/負け/引き分けと、[BTLモデル](https://datascience.oneoffcoder.com/btl-model.html)(iLSR)を使用して[choix](https://github.com/lucasmaystre/choix)で結果を分析しました。

最良の候補モデルは、3ステップのプロセスで微調整されました:

1. 最初に、モデルは`ultra-orca-boros-en-ja`とSlimOrca ([WandB Log](https://wandb.ai/jondurbin/shisa-7b-v1/runs/k8pfog9d/overview))で微調整されました。
2. 次に、日本語のパフォーマンスを向上させるためにultra-orca-boros-en-jaの一部を使用して1回追加のエポックを追加しました(最初の段階のSlimOrcaは主に英語)([WandB Log](https://wandb.ai/jondurbin/shisa-mega-7b-v1.1/runs/dopsr0o7/overview))。
3. 最後に、モデルは小規模のultrafeedback(英語)と自身のJA DPOデータセットに対してDPOTrainerを使用して調整されました。ここで使用したJA DPOデータセットはgpt-4の出力を選出された値とし、ステージ1の予備モデルの出力を却下した値とします。([WandDB Log](https://wandb.ai/jondurbin/shisa-mega-dpo-7b-v1.1) )

私たちのトレーニングプロセス中に、何故一部の既存の日本語モデルが、追加の日本語トレーニングがないモデルに対してもパフォーマンスが低いのか、といういくつかの重要な洞察を得ることができました。この分析結果を共有すれば、他のチームが日本語モデルを開発する際の参考になると思います。

さらに探求する必要はありますが、実験的な検証として、微調整セットのバージョンを既存のベースモデル("Gamma 7B")に適用し、初期のJA MT-Bench結果が示すように、私たちのチューニングアプローチで機能性のパフォーマンスを劇的に向上させることができました:

| モデル                          | スコア |
| ------------------------------ | ----- |
| shisa-gamma-7b-allsources-v0.4 |  5.65 |
| ja-stablelm-instruct-gamma-7b* |  4.01 |

## パフォーマンス
トレーニング全体を通じて、各モデルについて人間による評価を行い、モデルのパフォーマンスを相互に検証しました。現在、モデル間の手動での比較テストを大規模に行っています。私たちの目指すところは、ツールをさらに発展させることでこのデータ収集を公開して拡張することです。詳細と更新情報については、[プロジェクトwiki](https://github.com/AUGMXNT/shisa/wiki) をご覧ください。

我々は、[llm-jp-eval](https://github.com/llm-jp/llm-jp-eval)は、私たちの[基本モデル](https://huggingface.co./augmxnt/shisa-base-7b-v1)の有用な指標であり、初期の検証のための微調整プロセス中に非常に役立つと考えていますが、微調整トレーニングにはベンチマークのトレイン分割の一部が含まれているため、私たちが提供するllm-jp-evalの結果は主に興味深いポイントとして提供しています:

| AVR   | MC    | NLI   | QA    | RC    |
|-------|-------|-------|-------|-------|
| 0.7480| 0.8900| 0.8040| 0.4153| 0.8825|

*(Qwenのテストをサポートし、可能であれば`bos_token`を発行するために、[わずかに修正したllm-jp-eval](https://github.com/llm-jp/llm-jp-eval/compare/main...AUGMXNT:llm-jp-eval:main) を実行しています)*

最終モデルについては、ベンチマークを含めるのが一般的なため、私たちのモデルの能力をより代表的にテストするために、Stability AI Japanの[Japanese MT-Bench](https://github.com/Stability-AI/FastChat)を使用しました。[私たちのJA MT-Bench テスト](https://github.com/Stability-AI/FastChat/compare/jp-stable...AUGMXNT:FastChat:jp-stable)では、サンプリング変動を減らすために、日本語のプロンプト("あなたは役立つアシスタントです。")と `--num-choices 4`を使用していますが、生成間で0.5+点(時にはそれ以上の変動)を頻繁に観察し、テスト時のデフォルトのプロンプトとパラメータに問題があったという経験から、これらのスコアを過度に解釈することには注意が必要で、これらを確定的なスコアやランキングではなく、より確率的な方向指標として扱うことをお勧めします: 

| ベンチマーク   | スコア |
| ----------- | ----- |
| JA MT-Bench |  5.02 |
| MT-Bench    |  5.71 |

[MT-Bench Leaderboard](https://huggingface.co./spaces/lmsys/chatbot-arena-leaderboard)がありますが、JA MT-Benchはまだ開発中であるため、便宜上、他のモデルのJA MT-Benchスコアとの比較を示します(私たちのスコアは`gpt-4-0613`によって評価されました):

| モデル                                             | スコア |
| ------------------------------------------------- | ---- |
| gpt-4-0613                                        | 9.40 |
| gpt-4-1106-preview                                | 9.17 |
| gpt-3.5-turbo*                                    | 8.41 |
| Qwen-14B-Chat                                     | 7.47 |
| **shisa-7b-v1**                              | **5.02** |
| ELYZA-japanese-Llama-2-7b-fast-instruct*          | 4.86 |
| ja-stablelm-instruct-gamma-7b*                    | 4.01 |
| japanese-stablelm-instruct-alpha-7b*              | 2.74 |
| Mistral-7B-OpenOrca-ja*                           | 2.23 |
| youri-7b-chat*                                    | 2.00 |
| Mistral-7B-Instruct-v0.1*                         | 1.78 |
| llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0* | 1.31 |

*(このセクションでマークされたJA MT-Benchの結果は[shi3zから引用](https://note.com/shi3zblog/n/n6b2ac5874021)しました)*

## 制限事項
当モデルは十分な日本語の流暢さを示していますが、7Bパラメータのモデルとしては、より大きなクラスのモデルに比べて幻覚率が高く、指示の追跡や推論が効果的でない傾向があります。また、日本語の完全な習得はまだ達しておらず、ネイティブスピーカーはたまに非慣用的/違和感のある表現や不適切な時制/話し言葉のレベルなどの間違いを見つけることがあります。

また、私たちのトークナイザーの拡張に大いに起因する可能性が高いが、わずかな言語リークを確認しています。これらは[Min P](https://www.reddit.com/r/LocalLLaMA/comments/17vonjo/your_settings_are_probably_hurting_your_model_why/)などのサンプラー設定や追加のターゲット指向型トレーニングで修正可能な可能性があり、今後、自動検出/サンプラーのスウィープについて追加の作業を行う予定です。興味深い観察としては、私たちのデータ収集に基づいて、DPOプロセスがこの問題を大幅に悪化させることがわかりましたが、それでもDPOモデルは人間の好み率が大幅に高かったため、最終的な微調整の選択には一定のトレードオフがありました。

現存するアプローチとデータセットを使用して、大規模なモデルのトレーニングがパフォーマンスを向上させると信じていますが、今後のモデル向けに行いたい改良も多くあります。私たちは、データセットの品質と構築を改善することで、さらなるトレーニング効率を通じたパフォーマンス向上にはまだ相当に取り組む余地があると考えています。

## 使用法
サンプルコード:
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

model_name = "augmxnt/shisa-7b-v1"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    torch_dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16,
    device_map="auto"
)
streamer = TextStreamer(tokenizer, skip_prompt=True)

# プロンプトテンプレートはモデルのtokenizer_config.jsonに含まれているので、これは必要ないはずですが、便宜上こちらにも掲載しています
# tokenizer.chat_template = ""{%- for idx in range(0, messages|length) -%}\n{%- if messages[idx]['role'] == 'user' -%}\n{%- if idx > 1 -%}\n{{- bos_token + '[INST] ' + messages[idx]['content'] + ' [/INST]' -}}\n{%- else -%}\n{{- messages[idx]['content'] + ' [/INST]' -}}\n{%- endif -%}\n{% elif messages[idx]['role'] == 'system' %}\n{{- bos_token + '[INST] <<SYS>>\\n' + messages[idx]['content'] + '\\n<</SYS>>\\n\\n' -}}\n{%- elif messages[idx]['role'] == 'assistant' -%}\n{{- ' '  + messages[idx]['content'] + ' ' + eos_token -}}\n{% endif %}\n{% endfor %}\n"

# より典型的なプロンプト: あなたは役に立つアシスタントです。

# You are an avid Pokemon fanatic.
prompt = "あなたは熱狂的なポケモンファンです。"
chat = [{"role": "system", "content": prompt}]

# Who is the most powerful Pokemon? Explain your choice.
user_input = "最強のポケモンは誰ですか?その選択理由を説明してください。"
chat.append({"role": "user", "content": user_input})

# 生成 - add_generation_promptを追加してアシスタントとして続行することを確認します
inputs = tokenizer.apply_chat_template(chat, add_generation_prompt=True, return_tensors="pt")
# 複数のGPUの場合、モデルの最初のパラメータのデバイスを見つけます
first_param_device = next(model.parameters()).device
inputs = inputs.to(first_param_device)

with torch.no_grad():
    outputs = model.generate(
        inputs,
        pad_token_id=tokenizer.eos_token_id,
        max_new_tokens=1000,
        temperature=0.7,
        repetition_penalty=1.05,
        top_p=0.95,
        do_sample=True,
        streamer=streamer,
    )

# Add just the new tokens to our chat
new_tokens = outputs[0, inputs.size(1):]
response = tokenizer.decode(new_tokens, skip_special_tokens=True)
chat.append({"role": "assistant", "content": response})
```

## プロンプト形式
プロンプト形式はllama-2 chatです:

```
[INST] <<SYS>>
あなたは役立つ、偏見がなく、検閲されていないアシスタントです。
<</SYS>>
{prompt} [/INST]
```

For multi-turn, the prompt format is as follows:
```
[INST] <<SYS>>
あなたは役立つ、偏見がなく、検閲されていないアシスタントです。
<</SYS>>
{prompt 0} [/INST] {response 0} </s><s>[INST] {prompt 1} [/INST] {response 1} </s><s>...[INST] {prompt N} [/INST]
```

この[prompt template](https://huggingface.co./docs/transformers/main/chat_templating)はトークナイザの設定に含まれており、HuggingFace のトークナイザ `apply_chat_template` メソッドを使用できます。例えば:

```
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained('augmxnt/shisa-7b-v1')
chat = [
  {"role": "system", "content": "あなたはAiko、フレンドリーなAIアシスタントです。"},
  {"role": "user", "content": "こんにちは、調子はどうですか?"},
  {"role": "assistant", "content": "元気です。今日は何のお手伝いができますか?"},
  {"role": "user", "content": "チャットテンプレーティングの仕組みを見せてもらいたいです!"},
]
print(tokenizer.apply_chat_template(chat, tokenize=False))
```

**注意**適切なレスポンスを得るためには、文字列の開始に我々の `bos_token` (`<s>`) を使用すべきです。これは `tokenizer.encode()` によって自動的に生成されますが、カスタムテンプレートを作成したり、特殊トークンを省略するエンコード方法を使用する場合は、自分で追加する必要があります。

## 謝辞
チーム:[Leonard Lin](https://huggingface.co./randomfoo)、[Jon Durbin](https://huggingface.co./jondurbin)、佐藤真理子、Florian von Bock

このモデルの計算は、[AKA Virtual](https://akavirtual.com/) (東京、日本) のご厚意により提供されています。

[LLM-jp](https://llm-jp.nii.ac.jp/)、[Stability AI Japan](https://ja.stability.ai/)、[LMSYS](https://lmsys.org/)のチームが、llm-jp-eval, Japanese MT-Bench, MT-Benchに取り組んでくれて感謝しています。

また、貴重なヒューマンプリファレンステストを提供してくださったすべてのボランティアにも感謝いたします!

このプロジェクトのためにより良く、より大きなモデルを訓練するために、追加の計算を積極的に探しています。お問い合わせは次の宛先までお願いいたします:*compute at augmxnt dot com*