mobicham commited on
Commit
5838ad7
·
verified ·
1 Parent(s): d475e71

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -4
README.md CHANGED
@@ -8,11 +8,9 @@ pipeline_tag: text-generation
8
  ---
9
  ## Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bitgs8-metaoffload-HQQ
10
  This is a version of the
11
- <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1"> Mixtral-8x7B-Instruct-v0.1 model</a> quantized with a mix of 4-bit and 2-bit via Half-Quadratic Quantization (HQQ).
12
 
13
- More specifically, the attention layers are quantized to 4-bit and the experts are quantized to 2-bit.
14
-
15
- The difference between this model and <a href="https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-HQQ"> this </a> is that this one offloads the metadata to the CPU and you only need 13GB Vram to run it instead of 20GB!
16
 
17
  ![image/gif](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/-gwGOZHDb9l5VxLexIhkM.gif)
18
 
 
8
  ---
9
  ## Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bitgs8-metaoffload-HQQ
10
  This is a version of the
11
+ <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1"> Mixtral-8x7B-Instruct-v0.1 model</a> quantized with a mix of 4-bit and 2-bit via Half-Quadratic Quantization (HQQ). More specifically, the attention layers are quantized to 4-bit and the experts are quantized to 2-bit.
12
 
13
+ This model was designed to get the best quality at a budget of ~13GB of VRAM. It reaches an impressive <b>70.01</b> LLM leaderboard score, not too far from the original model's <b>72.62</b>.
 
 
14
 
15
  ![image/gif](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/-gwGOZHDb9l5VxLexIhkM.gif)
16