nm-research commited on
Commit
c6bc9c9
·
verified ·
1 Parent(s): f3bdfac

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +78 -0
README.md ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - vllm
4
+ - sparsity
5
+ - quantized
6
+ pipeline_tag: text-generation
7
+ license: llama3.1
8
+ base_model: neuralmagic/Sparse-Llama-3.1-8B-gsm8k-2of4
9
+ datasets:
10
+ - openai/gsm8k
11
+ language:
12
+ - en
13
+ metrics:
14
+ - accuracy
15
+ ---
16
+
17
+ # Sparse-Llama-3.1-8B-gsm8k-2of4-FP8-dynamic
18
+
19
+ ## Model Overview
20
+ - **Model Architecture:** Llama-3.1-8B
21
+ - **Input:** Text
22
+ - **Output:** Text
23
+ - **Model Optimizations:**
24
+ - **Sparsity:** 2:4
25
+ - **Weight quantization:** FP8
26
+ - **Activation quantization:** FP8
27
+ - **Release Date:** 11/21/2024
28
+ - **Version:** 1.0
29
+ - **License(s):** [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
30
+ - **Model Developers:** Neural Magic
31
+
32
+ This is AI model especialized in grade-school math obtained by fine-tuning the 2:4 sparse [Sparse-Llama-3.1-8B-2of4](https://huggingface.co/neuralmagic/Sparse-Llama-3.1-8B-2of4) on the [GSM8k](https://huggingface.co/datasets/openai/gsm8k) dataset, followed by one-shot quantization.
33
+ It achieves 66.8% 0-shot accuracy on the test set of GSM8k, compared to 66.3% for the fine-tuned dense model [Llama-3.1-8B-gsm8k](https://huggingface.co/neuralmagic/Llama-3.1-8B-gsm8k) — demonstrating over **100.0% accuracy recovery**.
34
+ In constrast, the pretrained [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) achieves 50.7% 5-shot accuracy and the sparse foundational [Sparse-Llama-3.1-8B-2of4](https://huggingface.co/neuralmagic/Sparse-Llama-3.1-8B-2of4) model achieves 56.3% 5-shot accuracy.
35
+
36
+
37
+ ### Model Optimizations
38
+
39
+ This model was obtained by quantizing the weights of [Sparse-Llama-3.1-8B-gsm8k-2of4](https://huggingface.co/neuralmagic/Sparse-Llama-3.1-8B-gsm8k-2of4) to INT4 data type.
40
+ This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
41
+ Weight quantization also reduces disk size requirements by approximately 50%.
42
+
43
+ Only weights and activations of the linear operators within transformers blocks are quantized.
44
+ Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between FP8 and BF16 representations for each output channel dimension.
45
+ Linear scaling factors are computed via by minimizing the mean squarred error (MSE).
46
+ Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between FP8 and BF16 representations.
47
+ The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
48
+
49
+
50
+ ## Deployment with vLLM
51
+
52
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend. vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
53
+
54
+
55
+ ## Evaluation
56
+
57
+ This model was evaluated on the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).
58
+
59
+ ### Accuracy
60
+ #### GSM8k Benchmark
61
+ <table>
62
+ <tr>
63
+ <td><strong>Metric</strong></td>
64
+ <td style="text-align: center"><strong>Llama-3.1-8B<br>(5-shot)</strong></td>
65
+ <td style="text-align: center"><strong>Sparse-Llama-3.1-8B-2of4<br>(5-shot)</strong></td>
66
+ <td style="text-align: center"><strong>Llama-3.1-8B-gsm8k<br>(0-shot)</strong></td>
67
+ <td style="text-align: center"><strong>Sparse-Llama-3.1-8B-gsm8k-2of4<br>(0-shot)</strong></td>
68
+ <td style="text-align: center"><strong>Sparse-Llama-3.1-8B-gsm8k-2of4-FP8-dynamic<br>(0-shot)</strong></td>
69
+ </tr>
70
+ <tr>
71
+ <td>Accuracy</td>
72
+ <td style="text-align: center">50.7%</td>
73
+ <td style="text-align: center">56.3%</td>
74
+ <td style="text-align: center">66.3%</td>
75
+ <td style="text-align: center">66.9%</td>
76
+ <td style="text-align: center">66.8%</td>
77
+ </tr>
78
+ </table>