Raincleared
commited on
Commit
•
cd79cdd
1
Parent(s):
907f92d
Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
@@ -3,7 +3,7 @@ language:
|
|
3 |
- en
|
4 |
library_name: transformers
|
5 |
license: llama2
|
6 |
-
|
7 |
---
|
8 |
|
9 |
|
@@ -48,7 +48,7 @@ Intuitively, training the model with even more tokens or with data of a wider co
|
|
48 |
|
49 |
### ProSparse: Training Methodology
|
50 |
|
51 |
-
The training process of ProSparse consists of three steps (refer to Section 3.2 of [paper](
|
52 |
|
53 |
1. **Activation Function Substitution**: We substituting the activation function of FFNs with ReLU and applying continual training;
|
54 |
2. **Progressive Sparsity Regularization**: We jointly optimize the model on the conventional next-token prediction loss and \\(L_1\\) regularization loss. The regularization is applied to the sparse intermediate outputs of FFNs with a regularization factor increasing progressively in multiple stages. Specifically, the regularization factor \\(\lambda\\) is set to a small constant for the warmup stage, and then increases along a smooth sine curve for each of the subsequent incremental stages. Each stage is accompanied by certain steps of training. In this way, the model can have more time to adapt to the increasing regularization without radical activation shifts, thus alleviating performance degradation.
|
@@ -77,7 +77,7 @@ The 13B model is trained on 32 A100 GPUs. The learning rate (LR) is controlled b
|
|
77 |
|
78 |
### Evaluation Results
|
79 |
|
80 |
-
The evaluation results on the above benchmarks demonstrate the advantage of ProSparse, which is the only method achieving high sparsity and comparable performance to the original Swish-activated LLaMA2. Note that models under all settings are trained with the same number of tokens on the same mixed dataset. Refer to Section 4.2 of [paper](
|
81 |
|
82 |
| Setting | Average<br>Sparsity | Code<br>Generation | Commonsense<br>Reasoning | Reading<br>Comprehension | GSM8K | MMLU | BBH | AGI Eval | Average |
|
83 |
| :-------------------: | :-----------------: | :----------------: | :----------------------: | :----------------------: | :---: | :---: | :---: | :---------: | :-----: |
|
@@ -106,7 +106,7 @@ Moreover, considering the potential inference inaccuracies caused by wrong predi
|
|
106 |
|
107 |
where \\(\mathbf{s}\\), \\(\mathbf{x}\\), \\(\mathbf{x}_1\\), and \\(\odot\\) denote the gating scores, the FFN input hidden states, the intermediate outputs, and the element-wise multiplication respectively. \\(\mathbf{W}_1\\) and \\(\mathbf{W}_2\\) are FFN weight matrices.
|
108 |
|
109 |
-
The acceleration effects of LLMs with different sparsity are displayed as follows. ProSparse, which reaches a high sparsity without performance degradation, can gain the most benefits among all the settings concerned. Refer to Section 4.3 of [paper](
|
110 |
|
111 |
| Setting | Average<br>Sparsity | Activation<br>Recall | Predicted<br>Sparsity | PowerInfer<br>Speed | `S2`<br>Time | `S2`<br>Speedup | `S3`<br/>Time | `S3`<br/>Speedup |
|
112 |
| :-------------------: | :-----------------: | :------------------: | :-------------------: | :-----------------: | :--------------: | :-----------------: | :---------------: | :------------------: |
|
@@ -140,9 +140,11 @@ Please kindly cite using the following BibTeX:
|
|
140 |
title={{ProSparse}: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models},
|
141 |
author={Song, Chenyang and Han, Xu and Zhang, Zhengyan and Hu, Shengding and Shi, Xiyu and Li, Kuai and Chen, Chen and Liu, Zhiyuan and Li, Guangli and Yang, Tao and Sun, Maosong},
|
142 |
year={2024},
|
|
|
|
|
143 |
}
|
144 |
```
|
145 |
|
146 |
#### Acknowledgments
|
147 |
|
148 |
-
The model card is modified from [ReluLLaMA-13B](https://huggingface.co/SparseLLM/ReluLLaMA-13B).
|
|
|
3 |
- en
|
4 |
library_name: transformers
|
5 |
license: llama2
|
6 |
+
|
7 |
---
|
8 |
|
9 |
|
|
|
48 |
|
49 |
### ProSparse: Training Methodology
|
50 |
|
51 |
+
The training process of ProSparse consists of three steps (refer to Section 3.2 of [paper](https://arxiv.org/pdf/2402.13516.pdf) for more details):
|
52 |
|
53 |
1. **Activation Function Substitution**: We substituting the activation function of FFNs with ReLU and applying continual training;
|
54 |
2. **Progressive Sparsity Regularization**: We jointly optimize the model on the conventional next-token prediction loss and \\(L_1\\) regularization loss. The regularization is applied to the sparse intermediate outputs of FFNs with a regularization factor increasing progressively in multiple stages. Specifically, the regularization factor \\(\lambda\\) is set to a small constant for the warmup stage, and then increases along a smooth sine curve for each of the subsequent incremental stages. Each stage is accompanied by certain steps of training. In this way, the model can have more time to adapt to the increasing regularization without radical activation shifts, thus alleviating performance degradation.
|
|
|
77 |
|
78 |
### Evaluation Results
|
79 |
|
80 |
+
The evaluation results on the above benchmarks demonstrate the advantage of ProSparse, which is the only method achieving high sparsity and comparable performance to the original Swish-activated LLaMA2. Note that models under all settings are trained with the same number of tokens on the same mixed dataset. Refer to Section 4.2 of [paper](https://arxiv.org/pdf/2402.13516.pdf) for more details.
|
81 |
|
82 |
| Setting | Average<br>Sparsity | Code<br>Generation | Commonsense<br>Reasoning | Reading<br>Comprehension | GSM8K | MMLU | BBH | AGI Eval | Average |
|
83 |
| :-------------------: | :-----------------: | :----------------: | :----------------------: | :----------------------: | :---: | :---: | :---: | :---------: | :-----: |
|
|
|
106 |
|
107 |
where \\(\mathbf{s}\\), \\(\mathbf{x}\\), \\(\mathbf{x}_1\\), and \\(\odot\\) denote the gating scores, the FFN input hidden states, the intermediate outputs, and the element-wise multiplication respectively. \\(\mathbf{W}_1\\) and \\(\mathbf{W}_2\\) are FFN weight matrices.
|
108 |
|
109 |
+
The acceleration effects of LLMs with different sparsity are displayed as follows. ProSparse, which reaches a high sparsity without performance degradation, can gain the most benefits among all the settings concerned. Refer to Section 4.3 of [paper](https://arxiv.org/pdf/2402.13516.pdf) for more details.
|
110 |
|
111 |
| Setting | Average<br>Sparsity | Activation<br>Recall | Predicted<br>Sparsity | PowerInfer<br>Speed | `S2`<br>Time | `S2`<br>Speedup | `S3`<br/>Time | `S3`<br/>Speedup |
|
112 |
| :-------------------: | :-----------------: | :------------------: | :-------------------: | :-----------------: | :--------------: | :-----------------: | :---------------: | :------------------: |
|
|
|
140 |
title={{ProSparse}: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models},
|
141 |
author={Song, Chenyang and Han, Xu and Zhang, Zhengyan and Hu, Shengding and Shi, Xiyu and Li, Kuai and Chen, Chen and Liu, Zhiyuan and Li, Guangli and Yang, Tao and Sun, Maosong},
|
142 |
year={2024},
|
143 |
+
journal={arXiv preprint arXiv:2402.13516},
|
144 |
+
url={https://arxiv.org/pdf/2402.13516.pdf}
|
145 |
}
|
146 |
```
|
147 |
|
148 |
#### Acknowledgments
|
149 |
|
150 |
+
The model card is modified from [ReluLLaMA-13B](https://huggingface.co/SparseLLM/ReluLLaMA-13B).
|