mobicham commited on
Commit
78ace90
·
verified ·
1 Parent(s): cbbc082

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -13
README.md CHANGED
@@ -7,14 +7,15 @@ inference: false
7
  pipeline_tag: text-generation
8
  ---
9
  ## Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-metaoffload-HQQ
10
- This is a version of the Mixtral-8x7B-Instruct-v0.1 model (https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) quantized with a mix of 4-bit and 2-bit via Half-Quadratic Quantization (HQQ).
11
 
12
- More specifically, the attention layers are quantized to 4-bit and the experts are quantized to 2-bit.
 
 
13
 
14
 
15
  ![image/gif](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/-gwGOZHDb9l5VxLexIhkM.gif)
16
 
17
- The difference between this model and https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-HQQ is that this one offloads the metadata to the CPU and you only need 13GB Vram to run it instead of 20GB!
18
 
19
  ----------------------------------------------------------------------------------------------------------------------------------
20
  </p>
@@ -23,14 +24,14 @@ The difference between this model and https://huggingface.co/mobiuslabsgmbh/Mixt
23
  ## Performance
24
  | Models | Mixtral Original | HQQ quantized |
25
  |-------------------|------------------|------------------|
26
- | Runtime VRAM | 90 GB | <b>13 GB</b> |
27
- | ARC (25-shot) | 70.22 | 66.47 |
28
- | Hellaswag (10-shot)| 87.63 | 84.78 |
29
- | MMLU (5-shot) | 71.16 | 67.35 |
30
- | TruthfulQA-MC2 | 64.58 | 62.85 |
31
- | Winogrande (5-shot)| 81.37 | 79.40 |
32
- | GSM8K (5-shot)| 60.73 | 45.86 |
33
- | Average| 72.62 | 67.79 |
34
 
35
 
36
  ## Screencast
@@ -104,8 +105,12 @@ model = HQQModelForCausalLM.from_pretrained(model_id, use_auth_token=hf_auth
104
  from hqq.core.quantize import *
105
  attn_prams = BaseQuantizeConfig(nbits=4, group_size=64, offload_meta=True)
106
  experts_params = BaseQuantizeConfig(nbits=2, group_size=16, offload_meta=True)
107
- attn_prams['scale_quant_params']['group_size'] = 256
108
- attn_prams['zero_quant_params']['group_size'] = 256
 
 
 
 
109
 
110
  quant_config = {}
111
  #Attention
 
7
  pipeline_tag: text-generation
8
  ---
9
  ## Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-metaoffload-HQQ
10
+ This is a version of the <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1"> Mixtral-8x7B-Instruct-v0.1 model</a> quantized with a mix of 4-bit and 2-bit via Half-Quadratic Quantization (HQQ). More specifically, the attention layers are quantized to 4-bit and the experts are quantized to 2-bit.
11
 
12
+ The difference between this model and <a href="https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-HQQ"> our previous release </a> is that this one offloads the metadata to the CPU and you only need 13GB Vram to run it instead of 20GB!
13
+
14
+ *Note*: this model was updated to use a group-size of 128 instead of 256 for the scale/zero parameters, which slightly improves the overall score with a negligible increase in VRAM.
15
 
16
 
17
  ![image/gif](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/-gwGOZHDb9l5VxLexIhkM.gif)
18
 
 
19
 
20
  ----------------------------------------------------------------------------------------------------------------------------------
21
  </p>
 
24
  ## Performance
25
  | Models | Mixtral Original | HQQ quantized |
26
  |-------------------|------------------|------------------|
27
+ | Runtime VRAM | 94 GB | <b>13.5 GB</b> |
28
+ | ARC (25-shot) | 70.22 | 66.55 |
29
+ | Hellaswag (10-shot)| 87.63 | 84.83 |
30
+ | MMLU (5-shot) | 71.16 | 67.39 |
31
+ | TruthfulQA-MC2 | 64.58 | 62.80 |
32
+ | Winogrande (5-shot)| 81.37 | 80.03 |
33
+ | GSM8K (5-shot)| 60.73 | 45.41 |
34
+ | Average| 72.62 | 67.83 |
35
 
36
 
37
  ## Screencast
 
105
  from hqq.core.quantize import *
106
  attn_prams = BaseQuantizeConfig(nbits=4, group_size=64, offload_meta=True)
107
  experts_params = BaseQuantizeConfig(nbits=2, group_size=16, offload_meta=True)
108
+ zero_scale_group_size = 128
109
+
110
+ attn_prams['scale_quant_params']['group_size'] = zero_scale_group_size
111
+ attn_prams['zero_quant_params']['group_size'] = zero_scale_group_size
112
+ experts_params['scale_quant_params']['group_size'] = zero_scale_group_size
113
+ experts_params['zero_quant_params']['group_size'] = zero_scale_group_size
114
 
115
  quant_config = {}
116
  #Attention