Raincleared commited on
Commit
cd79cdd
1 Parent(s): 907f92d

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +7 -5
README.md CHANGED
@@ -3,7 +3,7 @@ language:
3
  - en
4
  library_name: transformers
5
  license: llama2
6
- pipeline_tag: text-generation
7
  ---
8
 
9
 
@@ -48,7 +48,7 @@ Intuitively, training the model with even more tokens or with data of a wider co
48
 
49
  ### ProSparse: Training Methodology
50
 
51
- The training process of ProSparse consists of three steps (refer to Section 3.2 of [paper](TODO) for more details):
52
 
53
  1. **Activation Function Substitution**: We substituting the activation function of FFNs with ReLU and applying continual training;
54
  2. **Progressive Sparsity Regularization**: We jointly optimize the model on the conventional next-token prediction loss and \\(L_1\\) regularization loss. The regularization is applied to the sparse intermediate outputs of FFNs with a regularization factor increasing progressively in multiple stages. Specifically, the regularization factor \\(\lambda\\) is set to a small constant for the warmup stage, and then increases along a smooth sine curve for each of the subsequent incremental stages. Each stage is accompanied by certain steps of training. In this way, the model can have more time to adapt to the increasing regularization without radical activation shifts, thus alleviating performance degradation.
@@ -77,7 +77,7 @@ The 13B model is trained on 32 A100 GPUs. The learning rate (LR) is controlled b
77
 
78
  ### Evaluation Results
79
 
80
- The evaluation results on the above benchmarks demonstrate the advantage of ProSparse, which is the only method achieving high sparsity and comparable performance to the original Swish-activated LLaMA2. Note that models under all settings are trained with the same number of tokens on the same mixed dataset. Refer to Section 4.2 of [paper](TODO) for more details.
81
 
82
  | Setting | Average<br>Sparsity | Code<br>Generation | Commonsense<br>Reasoning | Reading<br>Comprehension | GSM8K | MMLU | BBH | AGI Eval | Average |
83
  | :-------------------: | :-----------------: | :----------------: | :----------------------: | :----------------------: | :---: | :---: | :---: | :---------: | :-----: |
@@ -106,7 +106,7 @@ Moreover, considering the potential inference inaccuracies caused by wrong predi
106
 
107
  where \\(\mathbf{s}\\), \\(\mathbf{x}\\), \\(\mathbf{x}_1\\), and \\(\odot\\) denote the gating scores, the FFN input hidden states, the intermediate outputs, and the element-wise multiplication respectively. \\(\mathbf{W}_1\\) and \\(\mathbf{W}_2\\) are FFN weight matrices.
108
 
109
- The acceleration effects of LLMs with different sparsity are displayed as follows. ProSparse, which reaches a high sparsity without performance degradation, can gain the most benefits among all the settings concerned. Refer to Section 4.3 of [paper](TODO) for more details.
110
 
111
  | Setting | Average<br>Sparsity | Activation<br>Recall | Predicted<br>Sparsity | PowerInfer<br>Speed | `S2`<br>Time | `S2`<br>Speedup | `S3`<br/>Time | `S3`<br/>Speedup |
112
  | :-------------------: | :-----------------: | :------------------: | :-------------------: | :-----------------: | :--------------: | :-----------------: | :---------------: | :------------------: |
@@ -140,9 +140,11 @@ Please kindly cite using the following BibTeX:
140
  title={{ProSparse}: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models},
141
  author={Song, Chenyang and Han, Xu and Zhang, Zhengyan and Hu, Shengding and Shi, Xiyu and Li, Kuai and Chen, Chen and Liu, Zhiyuan and Li, Guangli and Yang, Tao and Sun, Maosong},
142
  year={2024},
 
 
143
  }
144
  ```
145
 
146
  #### Acknowledgments
147
 
148
- The model card is modified from [ReluLLaMA-13B](https://huggingface.co/SparseLLM/ReluLLaMA-13B).
 
3
  - en
4
  library_name: transformers
5
  license: llama2
6
+
7
  ---
8
 
9
 
 
48
 
49
  ### ProSparse: Training Methodology
50
 
51
+ The training process of ProSparse consists of three steps (refer to Section 3.2 of [paper](https://arxiv.org/pdf/2402.13516.pdf) for more details):
52
 
53
  1. **Activation Function Substitution**: We substituting the activation function of FFNs with ReLU and applying continual training;
54
  2. **Progressive Sparsity Regularization**: We jointly optimize the model on the conventional next-token prediction loss and \\(L_1\\) regularization loss. The regularization is applied to the sparse intermediate outputs of FFNs with a regularization factor increasing progressively in multiple stages. Specifically, the regularization factor \\(\lambda\\) is set to a small constant for the warmup stage, and then increases along a smooth sine curve for each of the subsequent incremental stages. Each stage is accompanied by certain steps of training. In this way, the model can have more time to adapt to the increasing regularization without radical activation shifts, thus alleviating performance degradation.
 
77
 
78
  ### Evaluation Results
79
 
80
+ The evaluation results on the above benchmarks demonstrate the advantage of ProSparse, which is the only method achieving high sparsity and comparable performance to the original Swish-activated LLaMA2. Note that models under all settings are trained with the same number of tokens on the same mixed dataset. Refer to Section 4.2 of [paper](https://arxiv.org/pdf/2402.13516.pdf) for more details.
81
 
82
  | Setting | Average<br>Sparsity | Code<br>Generation | Commonsense<br>Reasoning | Reading<br>Comprehension | GSM8K | MMLU | BBH | AGI Eval | Average |
83
  | :-------------------: | :-----------------: | :----------------: | :----------------------: | :----------------------: | :---: | :---: | :---: | :---------: | :-----: |
 
106
 
107
  where \\(\mathbf{s}\\), \\(\mathbf{x}\\), \\(\mathbf{x}_1\\), and \\(\odot\\) denote the gating scores, the FFN input hidden states, the intermediate outputs, and the element-wise multiplication respectively. \\(\mathbf{W}_1\\) and \\(\mathbf{W}_2\\) are FFN weight matrices.
108
 
109
+ The acceleration effects of LLMs with different sparsity are displayed as follows. ProSparse, which reaches a high sparsity without performance degradation, can gain the most benefits among all the settings concerned. Refer to Section 4.3 of [paper](https://arxiv.org/pdf/2402.13516.pdf) for more details.
110
 
111
  | Setting | Average<br>Sparsity | Activation<br>Recall | Predicted<br>Sparsity | PowerInfer<br>Speed | `S2`<br>Time | `S2`<br>Speedup | `S3`<br/>Time | `S3`<br/>Speedup |
112
  | :-------------------: | :-----------------: | :------------------: | :-------------------: | :-----------------: | :--------------: | :-----------------: | :---------------: | :------------------: |
 
140
  title={{ProSparse}: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models},
141
  author={Song, Chenyang and Han, Xu and Zhang, Zhengyan and Hu, Shengding and Shi, Xiyu and Li, Kuai and Chen, Chen and Liu, Zhiyuan and Li, Guangli and Yang, Tao and Sun, Maosong},
142
  year={2024},
143
+ journal={arXiv preprint arXiv:2402.13516},
144
+ url={https://arxiv.org/pdf/2402.13516.pdf}
145
  }
146
  ```
147
 
148
  #### Acknowledgments
149
 
150
+ The model card is modified from [ReluLLaMA-13B](https://huggingface.co/SparseLLM/ReluLLaMA-13B).