Files changed (1) hide show
  1. README.md +79 -0
README.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: databricks/dbrx-instruct
3
+ ---
4
+ # dbrx_moe_fp8_test
5
+ - ## Introduction
6
+ This model was created by applying [Quark](https://quark.docs.amd.com/latest/index.html) with calibration samples from Pile dataset.
7
+ - ## Quantization Stragegy
8
+ - ***Quantized Layers***: All linear layers excluding "lm_head" and "router.layer"
9
+ - ***Weight***: FP8 symmetric per-tensor
10
+ - ***Activation***: FP8 symmetric per-tensor
11
+ - ***KV Cache***: FP8 symmetric per-tensor
12
+ - ## Quick Start
13
+ 1. [Download and install Quark](https://quark.docs.amd.com/latest/install.html)
14
+ 2. Run the quantization script in the example folder using the following command line:
15
+ ```sh
16
+ export MODEL_DIR = [local model checkpoint folder] or databricks/dbrx-instruct
17
+ # single GPU
18
+ python3 quantize_quark.py \
19
+ --model_dir $MODEL_DIR \
20
+ --output_dir dbrx_moe_fp8_test \
21
+ --quant_scheme w_fp8_a_fp8 \
22
+ --kv_cache_dtype fp8 \
23
+ --num_calib_data 128 \
24
+ --model_export quark_safetensors \
25
+ --no_weight_matrix_merge
26
+ # If model size is too large for single GPU, please use multi GPU instead.
27
+ python3 quantize_quark.py
28
+ --model_dir $MODEL_DIR \
29
+ --output_dir dbrx_moe_fp8_test\
30
+ --quant_scheme w_fp8_a_fp8 \
31
+ --kv_cache_dtype fp8 \
32
+ --num_calib_data 128 \
33
+ --multi_gpu \
34
+ --model_export quark_safetensors \
35
+ --no_weight_matrix_merge
36
+ ```
37
+ ## Deployment
38
+ Quark has its own export format and allows FP8 quantized models to be efficiently deployed using the vLLM backend(vLLM-compatible).
39
+ In the dbrx-instruct model, "transformer.blocks.\*.ffn.experts" modules can be divided into experts-num mlps, and if the shape of the weight of w1 in one of the mlps is [dim1, dim2],
40
+ then the shape of “transformer.blocks.\*.ffn.experts.mlp.w1.weight“ in the exported safetensors file is [dim1\*experts-num, dim2]. The shapes of "transformer.blocks.\*.ffn.experts.mlp.w1.weight_scale"
41
+ and "transformer.blocks.\*.ffn.experts.mlp.w1.input_scale" are [dim1]. Similarly, this also applies to the w2 and v1 of "transformer.blocks.\*.ffn.experts.mlp".
42
+ ## Evaluation
43
+ Quark currently uses perplexity(PPL) as the evaluation metric for accuracy loss before and after quantization.The specific PPL algorithm can be referenced in the quantize_quark.py.
44
+ The quantization evaluation results are conducted in pseudo-quantization mode, which may slightly differ from the actual quantized inference accuracy. These results are provided for reference only.
45
+ #### Evaluation scores
46
+ <table>
47
+ <tr>
48
+ <td><strong>Benchmark</strong>
49
+ </td>
50
+ <td><strong>dbrx-instruct </strong>
51
+ </td>
52
+ <td><strong>dbrx_moe_fp8_test(this model)</strong>
53
+ </td>
54
+ </tr>
55
+ <tr>
56
+ <td>Perplexity-wikitext2
57
+ </td>
58
+ <td>4.2275
59
+ </td>
60
+ <td>4.3033
61
+ </td>
62
+ </tr>
63
+
64
+ </table>
65
+
66
+ #### License
67
+ Modifications copyright(c) 2024 Advanced Micro Devices,Inc. All rights reserved.
68
+
69
+ Licensed under the Apache License, Version 2.0 (the "License");
70
+ you may not use this file except in compliance with the License.
71
+ You may obtain a copy of the License at
72
+
73
+ http://www.apache.org/licenses/LICENSE-2.0
74
+
75
+ Unless required by applicable law or agreed to in writing, software
76
+ distributed under the License is distributed on an "AS IS" BASIS,
77
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
78
+ See the License for the specific language governing permissions and
79
+ limitations under the License.