sharpenb commited on
Commit
d9f7631
1 Parent(s): 70f8b25

Upload folder using huggingface_hub (#1)

Browse files

- b022b3448af78fe2ee6cdc3744dc9e6591580ba780bdd5f5df66ff09226f73ff (673ffcdd111b1d7b482ec3c04eb992c8bfdf50ab)
- b017110984f54990e66b9fe0338d2f17cbd17ed07c5e946b29aaec5dd888c015 (dd81eede835734de9fb584615eaa5196cd0263cb)
- 39cc8a54a14ae9b4e748f6ac9a8b1e12a583b33fd4986c2ac28a8696e5e2c93a (2ddbdbe625f896113fad8e509817c24f34663744)

README.md ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ thumbnail: "https://assets-global.website-files.com/646b351987a8d8ce158d1940/64ec9e96b4334c0e1ac41504_Logo%20with%20white%20text.svg"
3
+ base_model: llama-moe/LLaMA-MoE-v1-3_5B-2_8
4
+ metrics:
5
+ - memory_disk
6
+ - memory_inference
7
+ - inference_latency
8
+ - inference_throughput
9
+ - inference_CO2_emissions
10
+ - inference_energy_consumption
11
+ tags:
12
+ - pruna-ai
13
+ ---
14
+ <!-- header start -->
15
+ <!-- 200823 -->
16
+ <div style="width: auto; margin-left: auto; margin-right: auto">
17
+ <a href="https://www.pruna.ai/" target="_blank" rel="noopener noreferrer">
18
+ <img src="https://i.imgur.com/eDAlcgk.png" alt="PrunaAI" style="width: 100%; min-width: 400px; display: block; margin: auto;">
19
+ </a>
20
+ </div>
21
+ <!-- header end -->
22
+
23
+ [![Twitter](https://img.shields.io/twitter/follow/PrunaAI?style=social)](https://twitter.com/PrunaAI)
24
+ [![GitHub](https://img.shields.io/github/followers/PrunaAI?label=Follow%20%40PrunaAI&style=social)](https://github.com/PrunaAI)
25
+ [![LinkedIn](https://img.shields.io/badge/LinkedIn-Connect-blue)](https://www.linkedin.com/company/93832878/admin/feed/posts/?feedType=following)
26
+ [![Discord](https://img.shields.io/badge/Discord-Join%20Us-blue?style=social&logo=discord)](https://discord.gg/CP4VSgck)
27
+
28
+ # Simply make AI models cheaper, smaller, faster, and greener!
29
+
30
+ - Give a thumbs up if you like this model!
31
+ - Contact us and tell us which model to compress next [here](https://www.pruna.ai/contact).
32
+ - Request access to easily compress your *own* AI models [here](https://z0halsaff74.typeform.com/pruna-access?typeform-source=www.pruna.ai).
33
+ - Read the documentations to know more [here](https://pruna-ai-pruna.readthedocs-hosted.com/en/latest/)
34
+ - Join Pruna AI community on Discord [here](https://discord.gg/CP4VSgck) to share feedback/suggestions or get help.
35
+
36
+ ## Results
37
+
38
+ ![image info](./plots.png)
39
+
40
+ **Frequently Asked Questions**
41
+ - ***How does the compression work?*** The model is compressed with llm-int8.
42
+ - ***How does the model quality change?*** The quality of the model output might vary compared to the base model.
43
+ - ***How is the model efficiency evaluated?*** These results were obtained on HARDWARE_NAME with configuration described in `model/smash_config.json` and are obtained after a hardware warmup. The smashed model is directly compared to the original base model. Efficiency results may vary in other settings (e.g. other hardware, image size, batch size, ...). We recommend to directly run them in the use-case conditions to know if the smashed model can benefit you.
44
+ - ***What is the model format?*** We use safetensors.
45
+ - ***What calibration data has been used?*** If needed by the compression method, we used WikiText as the calibration data.
46
+ - ***What is the naming convention for Pruna Huggingface models?*** We take the original model name and append "turbo", "tiny", or "green" if the smashed model has a measured inference speed, inference memory, or inference energy consumption which is less than 90% of the original base model.
47
+ - ***How to compress my own models?*** You can request premium access to more compression methods and tech support for your specific use-cases [here](https://z0halsaff74.typeform.com/pruna-access?typeform-source=www.pruna.ai).
48
+ - ***What are "first" metrics?*** Results mentioning "first" are obtained after the first run of the model. The first run might take more memory or be slower than the subsequent runs due cuda overheads.
49
+ - ***What are "Sync" and "Async" metrics?*** "Sync" metrics are obtained by syncing all GPU processes and stop measurement when all of them are executed. "Async" metrics are obtained without syncing all GPU processes and stop when the model output can be used by the CPU. We provide both metrics since both could be relevant depending on the use-case. We recommend to test the efficiency gains directly in your use-cases.
50
+
51
+ ## Setup
52
+
53
+ You can run the smashed model with these steps:
54
+
55
+ 0. Check requirements from the original repo llama-moe/LLaMA-MoE-v1-3_5B-2_8 installed. In particular, check python, cuda, and transformers versions.
56
+ 1. Make sure that you have installed quantization related packages.
57
+ ```bash
58
+ pip install transformers accelerate bitsandbytes>0.37.0
59
+ ```
60
+ 2. Load & run the model.
61
+ ```python
62
+ from transformers import AutoModelForCausalLM, AutoTokenizer
63
+
64
+
65
+ model = AutoModelForCausalLM.from_pretrained("PrunaAI/llama-moe-LLaMA-MoE-v1-3_5B-2_8-bnb-4bit-smashed", trust_remote_code=True, device_map='auto')
66
+ tokenizer = AutoTokenizer.from_pretrained("llama-moe/LLaMA-MoE-v1-3_5B-2_8")
67
+
68
+ input_ids = tokenizer("What is the color of prunes?,", return_tensors='pt').to(model.device)["input_ids"]
69
+
70
+ outputs = model.generate(input_ids, max_new_tokens=216)
71
+ tokenizer.decode(outputs[0])
72
+ ```
73
+
74
+ ## Configurations
75
+
76
+ The configuration info are in `smash_config.json`.
77
+
78
+ ## Credits & License
79
+
80
+ The license of the smashed model follows the license of the original model. Please check the license of the original model llama-moe/LLaMA-MoE-v1-3_5B-2_8 before using this model which provided the base model. The license of the `pruna-engine` is [here](https://pypi.org/project/pruna-engine/) on Pypi.
81
+
82
+ ## Want to compress other models?
83
+
84
+ - Contact us and tell us which model to compress next [here](https://www.pruna.ai/contact).
85
+ - Request access to easily compress your own AI models [here](https://z0halsaff74.typeform.com/pruna-access?typeform-source=www.pruna.ai).
config.json ADDED
@@ -0,0 +1,387 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/ceph/hdd/staff/charpent/.cache/modelsbijycn3y7u7f4q1e",
3
+ "add_weight_norm": false,
4
+ "architectures": [
5
+ "LlamaMoEForCausalLM"
6
+ ],
7
+ "auto_map": {
8
+ "AutoConfig": "configuration_llama_moe.LlamaMoEConfig",
9
+ "AutoModel": "llama-moe/LLaMA-MoE-v1-3_5B-2_8--modeling_llama_moe_hf.LlamaMoEModel",
10
+ "AutoModelForCausalLM": "modeling_llama_moe_hf.LlamaMoEForCausalLM"
11
+ },
12
+ "bos_token_id": 1,
13
+ "calculator_type": "UniversalCalculator",
14
+ "capacity_factor": 1.25,
15
+ "drop_tokens": true,
16
+ "dropped_padding": "zero",
17
+ "eos_token_id": 2,
18
+ "gate_add_noise": true,
19
+ "gate_balance_loss_weight": 0.01,
20
+ "gate_network": "mlp",
21
+ "gate_noise_epsilon": 0.01,
22
+ "gate_type": "TopKBalancedNoisyGate",
23
+ "gate_use_balance": true,
24
+ "gate_use_softmax": true,
25
+ "gates": "mlp",
26
+ "hidden_act": "silu",
27
+ "hidden_size": 4096,
28
+ "initializer_range": 0.02,
29
+ "intermediate_size": 11008,
30
+ "max_position_embeddings": 4096,
31
+ "model_type": "llama_moe",
32
+ "multiply_gate_scores": true,
33
+ "num_attention_heads": 32,
34
+ "num_experts": 8,
35
+ "num_hidden_layers": 32,
36
+ "num_key_value_heads": 32,
37
+ "num_selects": 2,
38
+ "pad_token_id": 0,
39
+ "pretraining_tp": 1,
40
+ "quantization_config": {
41
+ "_load_in_4bit": true,
42
+ "_load_in_8bit": false,
43
+ "bnb_4bit_compute_dtype": "bfloat16",
44
+ "bnb_4bit_quant_storage": "uint8",
45
+ "bnb_4bit_quant_type": "fp4",
46
+ "bnb_4bit_use_double_quant": false,
47
+ "llm_int8_enable_fp32_cpu_offload": false,
48
+ "llm_int8_has_fp16_weight": false,
49
+ "llm_int8_skip_modules": [
50
+ "lm_head"
51
+ ],
52
+ "llm_int8_threshold": 6.0,
53
+ "load_in_4bit": true,
54
+ "load_in_8bit": false,
55
+ "quant_method": "bitsandbytes"
56
+ },
57
+ "rms_norm_eps": 1e-05,
58
+ "rope_scaling": null,
59
+ "score_scale_factor": 4.0,
60
+ "size_experts": [
61
+ [
62
+ 1376,
63
+ 1376,
64
+ 1376,
65
+ 1376,
66
+ 1376,
67
+ 1376,
68
+ 1376,
69
+ 1376
70
+ ],
71
+ [
72
+ 1376,
73
+ 1376,
74
+ 1376,
75
+ 1376,
76
+ 1376,
77
+ 1376,
78
+ 1376,
79
+ 1376
80
+ ],
81
+ [
82
+ 1376,
83
+ 1376,
84
+ 1376,
85
+ 1376,
86
+ 1376,
87
+ 1376,
88
+ 1376,
89
+ 1376
90
+ ],
91
+ [
92
+ 1376,
93
+ 1376,
94
+ 1376,
95
+ 1376,
96
+ 1376,
97
+ 1376,
98
+ 1376,
99
+ 1376
100
+ ],
101
+ [
102
+ 1376,
103
+ 1376,
104
+ 1376,
105
+ 1376,
106
+ 1376,
107
+ 1376,
108
+ 1376,
109
+ 1376
110
+ ],
111
+ [
112
+ 1376,
113
+ 1376,
114
+ 1376,
115
+ 1376,
116
+ 1376,
117
+ 1376,
118
+ 1376,
119
+ 1376
120
+ ],
121
+ [
122
+ 1376,
123
+ 1376,
124
+ 1376,
125
+ 1376,
126
+ 1376,
127
+ 1376,
128
+ 1376,
129
+ 1376
130
+ ],
131
+ [
132
+ 1376,
133
+ 1376,
134
+ 1376,
135
+ 1376,
136
+ 1376,
137
+ 1376,
138
+ 1376,
139
+ 1376
140
+ ],
141
+ [
142
+ 1376,
143
+ 1376,
144
+ 1376,
145
+ 1376,
146
+ 1376,
147
+ 1376,
148
+ 1376,
149
+ 1376
150
+ ],
151
+ [
152
+ 1376,
153
+ 1376,
154
+ 1376,
155
+ 1376,
156
+ 1376,
157
+ 1376,
158
+ 1376,
159
+ 1376
160
+ ],
161
+ [
162
+ 1376,
163
+ 1376,
164
+ 1376,
165
+ 1376,
166
+ 1376,
167
+ 1376,
168
+ 1376,
169
+ 1376
170
+ ],
171
+ [
172
+ 1376,
173
+ 1376,
174
+ 1376,
175
+ 1376,
176
+ 1376,
177
+ 1376,
178
+ 1376,
179
+ 1376
180
+ ],
181
+ [
182
+ 1376,
183
+ 1376,
184
+ 1376,
185
+ 1376,
186
+ 1376,
187
+ 1376,
188
+ 1376,
189
+ 1376
190
+ ],
191
+ [
192
+ 1376,
193
+ 1376,
194
+ 1376,
195
+ 1376,
196
+ 1376,
197
+ 1376,
198
+ 1376,
199
+ 1376
200
+ ],
201
+ [
202
+ 1376,
203
+ 1376,
204
+ 1376,
205
+ 1376,
206
+ 1376,
207
+ 1376,
208
+ 1376,
209
+ 1376
210
+ ],
211
+ [
212
+ 1376,
213
+ 1376,
214
+ 1376,
215
+ 1376,
216
+ 1376,
217
+ 1376,
218
+ 1376,
219
+ 1376
220
+ ],
221
+ [
222
+ 1376,
223
+ 1376,
224
+ 1376,
225
+ 1376,
226
+ 1376,
227
+ 1376,
228
+ 1376,
229
+ 1376
230
+ ],
231
+ [
232
+ 1376,
233
+ 1376,
234
+ 1376,
235
+ 1376,
236
+ 1376,
237
+ 1376,
238
+ 1376,
239
+ 1376
240
+ ],
241
+ [
242
+ 1376,
243
+ 1376,
244
+ 1376,
245
+ 1376,
246
+ 1376,
247
+ 1376,
248
+ 1376,
249
+ 1376
250
+ ],
251
+ [
252
+ 1376,
253
+ 1376,
254
+ 1376,
255
+ 1376,
256
+ 1376,
257
+ 1376,
258
+ 1376,
259
+ 1376
260
+ ],
261
+ [
262
+ 1376,
263
+ 1376,
264
+ 1376,
265
+ 1376,
266
+ 1376,
267
+ 1376,
268
+ 1376,
269
+ 1376
270
+ ],
271
+ [
272
+ 1376,
273
+ 1376,
274
+ 1376,
275
+ 1376,
276
+ 1376,
277
+ 1376,
278
+ 1376,
279
+ 1376
280
+ ],
281
+ [
282
+ 1376,
283
+ 1376,
284
+ 1376,
285
+ 1376,
286
+ 1376,
287
+ 1376,
288
+ 1376,
289
+ 1376
290
+ ],
291
+ [
292
+ 1376,
293
+ 1376,
294
+ 1376,
295
+ 1376,
296
+ 1376,
297
+ 1376,
298
+ 1376,
299
+ 1376
300
+ ],
301
+ [
302
+ 1376,
303
+ 1376,
304
+ 1376,
305
+ 1376,
306
+ 1376,
307
+ 1376,
308
+ 1376,
309
+ 1376
310
+ ],
311
+ [
312
+ 1376,
313
+ 1376,
314
+ 1376,
315
+ 1376,
316
+ 1376,
317
+ 1376,
318
+ 1376,
319
+ 1376
320
+ ],
321
+ [
322
+ 1376,
323
+ 1376,
324
+ 1376,
325
+ 1376,
326
+ 1376,
327
+ 1376,
328
+ 1376,
329
+ 1376
330
+ ],
331
+ [
332
+ 1376,
333
+ 1376,
334
+ 1376,
335
+ 1376,
336
+ 1376,
337
+ 1376,
338
+ 1376,
339
+ 1376
340
+ ],
341
+ [
342
+ 1376,
343
+ 1376,
344
+ 1376,
345
+ 1376,
346
+ 1376,
347
+ 1376,
348
+ 1376,
349
+ 1376
350
+ ],
351
+ [
352
+ 1376,
353
+ 1376,
354
+ 1376,
355
+ 1376,
356
+ 1376,
357
+ 1376,
358
+ 1376,
359
+ 1376
360
+ ],
361
+ [
362
+ 1376,
363
+ 1376,
364
+ 1376,
365
+ 1376,
366
+ 1376,
367
+ 1376,
368
+ 1376,
369
+ 1376
370
+ ],
371
+ [
372
+ 1376,
373
+ 1376,
374
+ 1376,
375
+ 1376,
376
+ 1376,
377
+ 1376,
378
+ 1376,
379
+ 1376
380
+ ]
381
+ ],
382
+ "tie_word_embeddings": false,
383
+ "torch_dtype": "float16",
384
+ "transformers_version": "4.41.2",
385
+ "use_cache": true,
386
+ "vocab_size": 32000
387
+ }
configuration_llama_moe.py ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers.configuration_utils import PretrainedConfig
2
+
3
+
4
+ class LlamaMoEConfig(PretrainedConfig):
5
+ model_type = "llama_moe"
6
+ keys_to_ignore_at_inference = ["past_key_values"]
7
+
8
+ def __init__(
9
+ self,
10
+ vocab_size=32000,
11
+ hidden_size=4096,
12
+ intermediate_size=11008,
13
+ num_hidden_layers=32,
14
+ num_attention_heads=32,
15
+ num_key_value_heads=None,
16
+ hidden_act="silu",
17
+ max_position_embeddings=2048,
18
+ initializer_range=0.02,
19
+ rms_norm_eps=1e-6,
20
+ use_cache=True,
21
+ pad_token_id=0,
22
+ bos_token_id=1,
23
+ eos_token_id=2,
24
+ pretraining_tp=1,
25
+ tie_word_embeddings=False,
26
+ rope_scaling=None,
27
+ # -------- moe expert configs --------
28
+ num_experts=16,
29
+ num_selects=4,
30
+ size_experts=None,
31
+ # -------- moe gate configs --------
32
+ gate_type="TopKBalancedNoisyGate",
33
+ gate_network="mlp",
34
+ gate_use_softmax=True,
35
+ gate_use_balance=True,
36
+ gate_balance_loss_weight=1e-2,
37
+ gate_add_noise=True,
38
+ # TopKBalancedNoisyGate
39
+ gate_noise_epsilon=1e-2,
40
+ # -------- moe calculator configs --------
41
+ calculator_type="UniversalCalculator",
42
+ multiply_gate_scores=True,
43
+ score_scale_factor=1.0,
44
+ add_weight_norm=False,
45
+ # SwitchDropTokenCalculator
46
+ drop_tokens=True,
47
+ dropped_padding="zero",
48
+ capacity_factor=1.25,
49
+ **kwargs,
50
+ ):
51
+ self.vocab_size = vocab_size
52
+ self.max_position_embeddings = max_position_embeddings
53
+ self.hidden_size = hidden_size
54
+ self.intermediate_size = intermediate_size
55
+ self.num_hidden_layers = num_hidden_layers
56
+ self.num_attention_heads = num_attention_heads
57
+ self.hidden_act = hidden_act
58
+ self.initializer_range = initializer_range
59
+ self.rms_norm_eps = rms_norm_eps
60
+ self.pretraining_tp = pretraining_tp
61
+ self.use_cache = use_cache
62
+ self.rope_scaling = rope_scaling
63
+ self._rope_scaling_validation()
64
+
65
+ self.num_experts = num_experts
66
+ self.num_selects = num_selects
67
+ self.size_experts = size_experts
68
+
69
+ self.gate_type = gate_type
70
+ self.gate_network = gate_network
71
+ self.gate_use_softmax = gate_use_softmax
72
+ self.gate_use_balance = gate_use_balance
73
+ self.gate_balance_loss_weight = gate_balance_loss_weight
74
+ self.gate_add_noise = gate_add_noise
75
+ self.gate_noise_epsilon = gate_noise_epsilon
76
+
77
+ self.calculator_type = calculator_type
78
+ self.multiply_gate_scores = multiply_gate_scores
79
+ self.score_scale_factor = score_scale_factor
80
+ self.add_weight_norm = add_weight_norm
81
+ self.drop_tokens = drop_tokens
82
+ self.dropped_padding = dropped_padding
83
+ self.capacity_factor = capacity_factor
84
+
85
+ # for backward compatibility
86
+ if num_key_value_heads is None:
87
+ num_key_value_heads = num_attention_heads
88
+
89
+ self.num_key_value_heads = num_key_value_heads
90
+
91
+ super().__init__(
92
+ pad_token_id=pad_token_id,
93
+ bos_token_id=bos_token_id,
94
+ eos_token_id=eos_token_id,
95
+ tie_word_embeddings=tie_word_embeddings,
96
+ **kwargs,
97
+ )
98
+
99
+ def _rope_scaling_validation(self):
100
+ """
101
+ Validate the `rope_scaling` configuration.
102
+ """
103
+ if self.rope_scaling is None:
104
+ return
105
+
106
+ if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2:
107
+ raise ValueError(
108
+ "`rope_scaling` must be a dictionary with with two fields, `name` and `factor`, "
109
+ f"got {self.rope_scaling}"
110
+ )
111
+ rope_scaling_type = self.rope_scaling.get("type", None)
112
+ rope_scaling_factor = self.rope_scaling.get("factor", None)
113
+ if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic"]:
114
+ raise ValueError(
115
+ f"`rope_scaling`'s name field must be one of ['linear', 'dynamic'], got {rope_scaling_type}"
116
+ )
117
+ if (
118
+ rope_scaling_factor is None
119
+ or not isinstance(rope_scaling_factor, float)
120
+ or rope_scaling_factor <= 1.0
121
+ ):
122
+ raise ValueError(
123
+ f"`rope_scaling`'s factor field must be an float > 1, got {rope_scaling_factor}"
124
+ )
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "pad_token_id": 0,
6
+ "transformers_version": "4.41.2"
7
+ }
model-00001-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8b9eec6f276c67605104486fafcc7fca18357819a9787fbe9ae8b6ea892d0406
3
+ size 4992709848
model-00002-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c5a5694eb83c3acc8c7272584b317816a62e149a8a9d175cfa43b3aedec6633f
3
+ size 4989846240
model-00003-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:67837a8cc010f5315999f4a166c8ff008badf4eb6d169f8e3e6bae3b44723e3a
3
+ size 408709168
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
modeling_llama_moe_hf.py ADDED
@@ -0,0 +1,1664 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+ import warnings
3
+ from dataclasses import dataclass
4
+ from typing import Optional, Tuple
5
+
6
+ import torch
7
+ import torch.utils.checkpoint
8
+ import torch.nn as nn
9
+ import torch.nn.functional as F
10
+ from torch.distributions.normal import Normal
11
+ from transformers.modeling_outputs import (
12
+ CausalLMOutputWithPast,
13
+ )
14
+ from transformers.modeling_utils import PreTrainedModel
15
+ from transformers.activations import ACT2FN
16
+ from transformers.utils import ModelOutput, logging
17
+
18
+ from .configuration_llama_moe import LlamaMoEConfig
19
+
20
+ logger = logging.get_logger(__name__)
21
+
22
+ _CONFIG_FOR_DOC = "LlamaMoEConfig"
23
+
24
+
25
+ @dataclass
26
+ class CalculatorOutput(ModelOutput):
27
+ hidden_states: Optional[torch.FloatTensor] = None
28
+ num_dropped_tokens: Optional[int] = None
29
+
30
+
31
+ @dataclass
32
+ class BaseMoEModelOutputWithPast(ModelOutput):
33
+ """
34
+ Args:
35
+ num_dropped_tokens: layer idx to the number of dropped tokens
36
+ """
37
+
38
+ last_hidden_state: torch.FloatTensor = None
39
+ past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None
40
+ hidden_states: Optional[Tuple[torch.FloatTensor]] = None
41
+ attentions: Optional[Tuple[torch.FloatTensor]] = None
42
+ balance_loss: Optional[float] = None
43
+ num_dropped_tokens: Optional[Tuple[torch.Tensor]] = None
44
+ gate_load: Optional[Tuple[list]] = None
45
+ gate_importance: Optional[Tuple[list]] = None
46
+
47
+
48
+ @dataclass
49
+ class MoECausalLMOutputWithPast(CausalLMOutputWithPast):
50
+ balance_loss: Optional[float] = None
51
+ num_dropped_tokens: Optional[Tuple[int]] = None
52
+ gate_load: Optional[Tuple[list[torch.Tensor]]] = None
53
+ gate_importance: Optional[Tuple[list[torch.Tensor]]] = None
54
+
55
+
56
+ @dataclass
57
+ class MoEMlpOutput(ModelOutput):
58
+ hidden_states: Optional[torch.FloatTensor] = None
59
+ balance_loss: Optional[torch.FloatTensor] = None
60
+ num_dropped_tokens: Optional[int] = None
61
+ gate_load: Optional[list] = None
62
+ gate_importance: Optional[list] = None
63
+
64
+
65
+ def _make_causal_mask(
66
+ input_ids_shape: torch.Size, dtype: torch.dtype, device: torch.device, past_key_values_length: int = 0
67
+ ):
68
+ """
69
+ Make causal mask used for bi-directional self-attention.
70
+ """
71
+ bsz, tgt_len = input_ids_shape
72
+ mask = torch.full((tgt_len, tgt_len), torch.finfo(dtype).min, device=device)
73
+ mask_cond = torch.arange(mask.size(-1), device=device)
74
+ mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
75
+ mask = mask.to(dtype)
76
+
77
+ if past_key_values_length > 0:
78
+ mask = torch.cat([torch.zeros(tgt_len, past_key_values_length, dtype=dtype, device=device), mask], dim=-1)
79
+ return mask[None, None, :, :].expand(bsz, 1, tgt_len, tgt_len + past_key_values_length)
80
+
81
+
82
+ # Copied from transformers.models.bart.modeling_bart._expand_mask
83
+ def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
84
+ """
85
+ Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
86
+ """
87
+ bsz, src_len = mask.size()
88
+ tgt_len = tgt_len if tgt_len is not None else src_len
89
+
90
+ expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
91
+
92
+ inverted_mask = 1.0 - expanded_mask
93
+
94
+ return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min)
95
+
96
+
97
+ class LlamaRMSNorm(nn.Module):
98
+ def __init__(self, hidden_size, eps=1e-6):
99
+ """
100
+ LlamaRMSNorm is equivalent to T5LayerNorm
101
+ """
102
+ super().__init__()
103
+ self.weight = nn.Parameter(torch.ones(hidden_size))
104
+ self.variance_epsilon = eps
105
+
106
+ def forward(self, hidden_states):
107
+ input_dtype = hidden_states.dtype
108
+ hidden_states = hidden_states.to(torch.float32)
109
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
110
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
111
+ return self.weight * hidden_states.to(input_dtype)
112
+
113
+
114
+ class LlamaRotaryEmbedding(torch.nn.Module):
115
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
116
+ super().__init__()
117
+
118
+ self.dim = dim
119
+ self.max_position_embeddings = max_position_embeddings
120
+ self.base = base
121
+ inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
122
+ self.register_buffer("inv_freq", inv_freq)
123
+
124
+ # Build here to make `torch.jit.trace` work.
125
+ self._set_cos_sin_cache(
126
+ seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
127
+ )
128
+
129
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
130
+ self.max_seq_len_cached = seq_len
131
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
132
+
133
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
134
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
135
+ emb = torch.cat((freqs, freqs), dim=-1)
136
+ self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False)
137
+ self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False)
138
+
139
+ def forward(self, x, seq_len=None):
140
+ # x: [bs, num_attention_heads, seq_len, head_size]
141
+ if seq_len > self.max_seq_len_cached:
142
+ self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
143
+
144
+ return (
145
+ self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
146
+ self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
147
+ )
148
+
149
+
150
+ class LlamaLinearScalingRotaryEmbedding(LlamaRotaryEmbedding):
151
+ """LlamaRotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""
152
+
153
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
154
+ self.scaling_factor = scaling_factor
155
+ super().__init__(dim, max_position_embeddings, base, device)
156
+
157
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
158
+ self.max_seq_len_cached = seq_len
159
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
160
+ t = t / self.scaling_factor
161
+
162
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
163
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
164
+ emb = torch.cat((freqs, freqs), dim=-1)
165
+ self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False)
166
+ self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False)
167
+
168
+
169
+ class LlamaDynamicNTKScalingRotaryEmbedding(LlamaRotaryEmbedding):
170
+ """LlamaRotaryEmbedding extended with Dynamic NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla"""
171
+
172
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
173
+ self.scaling_factor = scaling_factor
174
+ super().__init__(dim, max_position_embeddings, base, device)
175
+
176
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
177
+ self.max_seq_len_cached = seq_len
178
+
179
+ if seq_len > self.max_position_embeddings:
180
+ base = self.base * (
181
+ (self.scaling_factor * seq_len / self.max_position_embeddings) - (self.scaling_factor - 1)
182
+ ) ** (self.dim / (self.dim - 2))
183
+ inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
184
+ self.register_buffer("inv_freq", inv_freq)
185
+
186
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
187
+
188
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
189
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
190
+ emb = torch.cat((freqs, freqs), dim=-1)
191
+ self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False)
192
+ self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False)
193
+
194
+
195
+ def rotate_half(x):
196
+ """Rotates half the hidden dims of the input."""
197
+ x1 = x[..., : x.shape[-1] // 2]
198
+ x2 = x[..., x.shape[-1] // 2 :]
199
+ return torch.cat((-x2, x1), dim=-1)
200
+
201
+
202
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids):
203
+ # The first two dimensions of cos and sin are always 1, so we can `squeeze` them.
204
+ cos = cos.squeeze(1).squeeze(0) # [seq_len, dim]
205
+ sin = sin.squeeze(1).squeeze(0) # [seq_len, dim]
206
+ cos = cos[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
207
+ sin = sin[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
208
+ q_embed = (q * cos) + (rotate_half(q) * sin)
209
+ k_embed = (k * cos) + (rotate_half(k) * sin)
210
+ return q_embed, k_embed
211
+
212
+
213
+ class LlamaMLP(nn.Module):
214
+ def __init__(self, config):
215
+ super().__init__()
216
+ self.pretraining_tp = config.pretraining_tp
217
+ self.hidden_size = config.hidden_size
218
+ self.intermediate_size = config.intermediate_size
219
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
220
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
221
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
222
+ self.act_fn = ACT2FN[config.hidden_act]
223
+
224
+ def forward(self, x):
225
+ if self.pretraining_tp > 1:
226
+ slice = self.intermediate_size // self.pretraining_tp
227
+ gate_proj_slices = self.gate_proj.weight.split(slice, dim=0)
228
+ up_proj_slices = self.up_proj.weight.split(slice, dim=0)
229
+ down_proj_slices = self.down_proj.weight.split(slice, dim=1)
230
+
231
+ gate_proj = torch.cat([F.linear(x, gate_proj_slices[i]) for i in range(self.pretraining_tp)], dim=-1)
232
+ up_proj = torch.cat([F.linear(x, up_proj_slices[i]) for i in range(self.pretraining_tp)], dim=-1)
233
+
234
+ intermediate_states = (self.act_fn(gate_proj) * up_proj).split(slice, dim=2)
235
+ down_proj = [F.linear(intermediate_states[i], down_proj_slices[i]) for i in range(self.pretraining_tp)]
236
+ down_proj = sum(down_proj)
237
+ else:
238
+ down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
239
+
240
+ return down_proj
241
+
242
+
243
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
244
+ """
245
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
246
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
247
+ """
248
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
249
+ if n_rep == 1:
250
+ return hidden_states
251
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
252
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
253
+
254
+
255
+ class LlamaAttention(nn.Module):
256
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
257
+
258
+ def __init__(self, config: LlamaMoEConfig):
259
+ super().__init__()
260
+ self.config = config
261
+ self.hidden_size = config.hidden_size
262
+ self.num_heads = config.num_attention_heads
263
+ self.head_dim = self.hidden_size // self.num_heads
264
+ self.num_key_value_heads = config.num_key_value_heads
265
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
266
+ self.pretraining_tp = config.pretraining_tp
267
+ self.max_position_embeddings = config.max_position_embeddings
268
+
269
+ if (self.head_dim * self.num_heads) != self.hidden_size:
270
+ raise ValueError(
271
+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
272
+ f" and `num_heads`: {self.num_heads})."
273
+ )
274
+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
275
+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
276
+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
277
+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
278
+ self._init_rope()
279
+
280
+ def _init_rope(self):
281
+ if self.config.rope_scaling is None:
282
+ self.rotary_emb = LlamaRotaryEmbedding(self.head_dim, max_position_embeddings=self.max_position_embeddings)
283
+ else:
284
+ scaling_type = self.config.rope_scaling["type"]
285
+ scaling_factor = self.config.rope_scaling["factor"]
286
+ if scaling_type == "linear":
287
+ self.rotary_emb = LlamaLinearScalingRotaryEmbedding(
288
+ self.head_dim, max_position_embeddings=self.max_position_embeddings, scaling_factor=scaling_factor
289
+ )
290
+ elif scaling_type == "dynamic":
291
+ self.rotary_emb = LlamaDynamicNTKScalingRotaryEmbedding(
292
+ self.head_dim, max_position_embeddings=self.max_position_embeddings, scaling_factor=scaling_factor
293
+ )
294
+ else:
295
+ raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
296
+
297
+ def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
298
+ return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
299
+
300
+ def forward(
301
+ self,
302
+ hidden_states: torch.Tensor,
303
+ attention_mask: Optional[torch.Tensor] = None,
304
+ position_ids: Optional[torch.LongTensor] = None,
305
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
306
+ output_attentions: bool = False,
307
+ use_cache: bool = False,
308
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
309
+ bsz, q_len, _ = hidden_states.size()
310
+
311
+ if self.pretraining_tp > 1:
312
+ key_value_slicing = (self.num_key_value_heads * self.head_dim) // self.pretraining_tp
313
+ query_slices = self.q_proj.weight.split((self.num_heads * self.head_dim) // self.pretraining_tp, dim=0)
314
+ key_slices = self.k_proj.weight.split(key_value_slicing, dim=0)
315
+ value_slices = self.v_proj.weight.split(key_value_slicing, dim=0)
316
+
317
+ query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.pretraining_tp)]
318
+ query_states = torch.cat(query_states, dim=-1)
319
+
320
+ key_states = [F.linear(hidden_states, key_slices[i]) for i in range(self.pretraining_tp)]
321
+ key_states = torch.cat(key_states, dim=-1)
322
+
323
+ value_states = [F.linear(hidden_states, value_slices[i]) for i in range(self.pretraining_tp)]
324
+ value_states = torch.cat(value_states, dim=-1)
325
+
326
+ else:
327
+ query_states = self.q_proj(hidden_states)
328
+ key_states = self.k_proj(hidden_states)
329
+ value_states = self.v_proj(hidden_states)
330
+
331
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
332
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
333
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
334
+
335
+ kv_seq_len = key_states.shape[-2]
336
+ if past_key_value is not None:
337
+ kv_seq_len += past_key_value[0].shape[-2]
338
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
339
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
340
+
341
+ if past_key_value is not None:
342
+ # reuse k, v, self_attention
343
+ key_states = torch.cat([past_key_value[0], key_states], dim=2)
344
+ value_states = torch.cat([past_key_value[1], value_states], dim=2)
345
+
346
+ past_key_value = (key_states, value_states) if use_cache else None
347
+
348
+ # repeat k/v heads if n_kv_heads < n_heads
349
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
350
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
351
+
352
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
353
+
354
+ if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
355
+ raise ValueError(
356
+ f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
357
+ f" {attn_weights.size()}"
358
+ )
359
+
360
+ if attention_mask is not None:
361
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
362
+ raise ValueError(
363
+ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
364
+ )
365
+ attn_weights = attn_weights + attention_mask
366
+
367
+ # upcast attention to fp32
368
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
369
+ attn_output = torch.matmul(attn_weights, value_states)
370
+
371
+ if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
372
+ raise ValueError(
373
+ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
374
+ f" {attn_output.size()}"
375
+ )
376
+
377
+ attn_output = attn_output.transpose(1, 2).contiguous()
378
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
379
+
380
+ if self.pretraining_tp > 1:
381
+ attn_output = attn_output.split(self.hidden_size // self.pretraining_tp, dim=2)
382
+ o_proj_slices = self.o_proj.weight.split(self.hidden_size // self.pretraining_tp, dim=1)
383
+ attn_output = sum([F.linear(attn_output[i], o_proj_slices[i]) for i in range(self.pretraining_tp)])
384
+ else:
385
+ attn_output = self.o_proj(attn_output)
386
+
387
+ if not output_attentions:
388
+ attn_weights = None
389
+
390
+ return attn_output, attn_weights, past_key_value
391
+
392
+
393
+ class TopKBalancedNoisyGate(nn.Module):
394
+ def __init__(
395
+ self,
396
+ input_size,
397
+ num_experts,
398
+ num_selects,
399
+ gate_network="mlp",
400
+ use_softmax=True,
401
+ use_balance=True,
402
+ balance_loss_weight=1e-2,
403
+ add_noise=True,
404
+ noise_epsilon=1e-2,
405
+ ):
406
+ super(TopKBalancedNoisyGate, self).__init__()
407
+ assert num_selects <= num_experts
408
+ self.input_size = input_size
409
+ self.num_experts = num_experts
410
+ self.num_selects = num_selects
411
+
412
+ self.gate_network_type = gate_network
413
+ self.gate_network = self.get_gate_network(gate_network, input_size, num_experts)
414
+
415
+ self.use_softmax = use_softmax
416
+ self.softmax = nn.Softmax(1)
417
+
418
+ self.use_balance = use_balance
419
+ self.balance_loss_weight = balance_loss_weight
420
+
421
+ # add_noise
422
+ self.add_noise = add_noise
423
+ self.noise_epsilon = noise_epsilon
424
+ self.warned = False
425
+ if self.add_noise:
426
+ self.weight_noise = nn.Linear(input_size, num_experts, bias=False)
427
+ self.weight_noise.weight.data = torch.zeros(
428
+ (num_experts, input_size),
429
+ requires_grad=True,
430
+ device=self.weight_noise.weight.data.device,
431
+ dtype=self.weight_noise.weight.data.dtype,
432
+ )
433
+ self.mean = 0.0
434
+ self.std = 1.0
435
+ self.normal = Normal(self.mean, self.std)
436
+ self.softplus = nn.Softplus()
437
+
438
+ self.reset_parameters()
439
+
440
+ def get_gate_network(self, gate_type, input_size, num_experts):
441
+ gate_type = gate_type.lower()
442
+
443
+ if gate_type == "linear":
444
+ gate_network = nn.Linear(input_size, num_experts, bias=False)
445
+ nn.init.zeros_(gate_network.weight)
446
+ elif gate_type == "mlp":
447
+ gate_network = torch.nn.Sequential(
448
+ torch.nn.Linear(input_size, num_experts, bias=False),
449
+ torch.nn.Tanh(),
450
+ torch.nn.Linear(num_experts, num_experts, bias=False),
451
+ )
452
+ else:
453
+ raise ValueError(f'Unexpected gate_type: {gate_type}.')
454
+
455
+ return gate_network
456
+
457
+ def reset_gate_network(self):
458
+ if "gate_network_type" not in vars(self):
459
+ raise KeyError(f"{type(self)} does not have a gate network.")
460
+ else:
461
+ self.gate_network = self.get_gate_network(
462
+ self.gate_network_type, self.input_size, self.num_experts
463
+ )
464
+
465
+ def reset_parameters(self):
466
+ if self.add_noise:
467
+ nn.init.zeros_(self.weight_noise.weight)
468
+ # nn.init.zeros_(self.weight_noise)
469
+
470
+ def cv_squared(self, x, eps=1e-10):
471
+ """The squared coefficient of variation of a sample.
472
+ Useful as a loss to encourage a positive distribution to be more uniform.
473
+ Epsilons added for numerical stability.
474
+ Returns 0 for an empty Tensor.
475
+ Args:
476
+ x: a `Tensor`.
477
+ Returns:
478
+ a `Scalar`.s
479
+ """
480
+ if x.shape[0] == 1:
481
+ return torch.tensor(0.0, device=x.device)
482
+ return x.float().var() / (x.float().mean() ** 2 + eps)
483
+
484
+ def forward(self, x):
485
+ logits_gate = self.gate_network(x)
486
+ if self.training and self.add_noise:
487
+ noise_mm = self.weight_noise(x)
488
+ noise_control = self.softplus(noise_mm) + self.noise_epsilon
489
+ logits_noise = torch.randn_like(logits_gate) * noise_control
490
+ logits = logits_gate + logits_noise
491
+ else:
492
+ logits = logits_gate
493
+
494
+ top_logits, top_indices = logits.topk(min(self.num_selects + 1, self.num_experts), dim=1) # 选择并排序前k+1个权重
495
+ top_k_logits = top_logits[:, :self.num_selects]
496
+ top_k_indices = top_indices[:, :self.num_selects]
497
+ top_k_scores = self.softmax(top_k_logits.to(torch.float32)) if self.use_softmax else top_k_logits
498
+ top_k_scores = top_k_scores.to(logits.dtype)
499
+
500
+ zeros = torch.zeros_like(logits, requires_grad=True, device=logits.device)
501
+ scores_filtered = zeros.scatter(dim=1, index=top_k_indices, src=top_k_scores) # shape(batch_size, num_experts)
502
+ importance = scores_filtered.sum(0) # shape(num_experts)
503
+
504
+ if self.training:
505
+ if self.add_noise and self.num_selects != self.num_experts:
506
+ batch_size = top_logits.size(0)
507
+ m = top_logits.size(1)
508
+ top_values_flat = top_logits.flatten()
509
+ threshold_positions_if_in = torch.arange(batch_size, device=x.device) * m + self.num_selects
510
+ threshold_if_in = torch.unsqueeze(torch.gather(top_values_flat, 0, threshold_positions_if_in), 1)
511
+ is_in = torch.gt(logits_noise, threshold_if_in)
512
+ threshold_positions_if_out = threshold_positions_if_in - 1
513
+ threshold_if_out = torch.unsqueeze(torch.gather(top_values_flat, 0, threshold_positions_if_out), 1)
514
+ # is each value currently in the top k.
515
+ prob_if_in = self.normal.cdf((logits_gate - threshold_if_in) / noise_control)
516
+ prob_if_out = self.normal.cdf((logits_gate - threshold_if_out) / noise_control)
517
+ prob = torch.where(is_in, prob_if_in, prob_if_out)
518
+ load = prob.sum(0)
519
+ else:
520
+ load = (scores_filtered > 0).sum(0)
521
+ if not self.add_noise and not self.warned:
522
+ warnings.warn('Gradient-trackable implementation for load calculation is only available when "add_noise=True". '
523
+ 'Training without noise will block the gradient from "load" path and lead to inconsistency in optimization objectives.')
524
+ self.warned = True
525
+ else:
526
+ load = (scores_filtered > 0).sum(0)
527
+
528
+ if self.use_balance:
529
+ balance_loss = self.cv_squared(importance) + self.cv_squared(load)
530
+ balance_loss *= self.balance_loss_weight
531
+ else:
532
+ balance_loss = torch.tensor(-100.0, device=x.device)
533
+
534
+ return {
535
+ "topK_indices": top_k_indices,
536
+ "topK_scores": top_k_scores,
537
+ "balance_loss": balance_loss,
538
+ "load": load,
539
+ "importance": importance,
540
+ }
541
+
542
+
543
+ class LinearGLUExperts(nn.Module):
544
+ """
545
+ Modified from transformers.models.llama.modeling_llama.LlamaMLP
546
+ """
547
+
548
+ __constants__ = [
549
+ "bias",
550
+ "in_features",
551
+ "hidden_features",
552
+ "out_features",
553
+ "hidden_act",
554
+ "num_experts",
555
+ "size_experts",
556
+ ]
557
+
558
+ def __init__(
559
+ self,
560
+ in_features,
561
+ hidden_features,
562
+ out_features,
563
+ hidden_act,
564
+ num_experts,
565
+ size_experts=None,
566
+ bias=True,
567
+ device=None,
568
+ dtype=None,
569
+ ):
570
+ factory_kwargs = {"device": device, "dtype": dtype}
571
+ super(LinearGLUExperts, self).__init__()
572
+ self.in_features = in_features
573
+ self.hidden_features = hidden_features
574
+ self.out_features = out_features
575
+ self.hidden_act = hidden_act
576
+ self.num_experts = num_experts
577
+
578
+ if size_experts is None:
579
+ # all experts share the same number of hidden neurons
580
+ assert hidden_features % num_experts == 0
581
+ size_per_expert = hidden_features // num_experts
582
+ size_experts = [size_per_expert for _ in range(num_experts)]
583
+ else:
584
+ # use specified expert sizes
585
+ assert (
586
+ len(size_experts) == num_experts
587
+ and sum(size_experts) == hidden_features
588
+ )
589
+ self.size_experts = size_experts
590
+
591
+ self.act_fn = ACT2FN[hidden_act]
592
+
593
+ self.weight_gate = nn.ParameterList()
594
+ self.weight_up = nn.ParameterList()
595
+ self.weight_down = nn.ParameterList()
596
+
597
+ for i in range(num_experts):
598
+ # this matrix will be transposed when performing linear forwarding
599
+ this_expert_weight_gate = nn.Parameter(
600
+ torch.empty((size_experts[i], in_features), **factory_kwargs)
601
+ )
602
+ # this matrix will be transposed when performing linear forwarding
603
+ this_expert_weight_up = nn.Parameter(
604
+ torch.empty((size_experts[i], in_features), **factory_kwargs)
605
+ )
606
+ # this matrix will be transposed when performing linear forwarding
607
+ this_expert_weight_down = nn.Parameter(
608
+ torch.empty((out_features, size_experts[i]), **factory_kwargs)
609
+ )
610
+ self.weight_gate.append(this_expert_weight_gate)
611
+ self.weight_up.append(this_expert_weight_up)
612
+ self.weight_down.append(this_expert_weight_down)
613
+
614
+ if bias:
615
+ self.bias_gate = nn.ParameterList()
616
+ self.bias_up = nn.ParameterList()
617
+ self.bias_down = nn.ParameterList()
618
+
619
+ for i in range(num_experts):
620
+ this_expert_bias_gate = nn.Parameter(
621
+ torch.empty((size_experts[i],), **factory_kwargs)
622
+ )
623
+ this_expert_bias_up = nn.Parameter(
624
+ torch.empty((size_experts[i],), **factory_kwargs)
625
+ )
626
+ this_expert_bias_down = nn.Parameter(
627
+ torch.empty((out_features,), **factory_kwargs)
628
+ )
629
+ self.bias_gate.append(this_expert_bias_gate)
630
+ self.bias_up.append(this_expert_bias_up)
631
+ self.bias_down.append(this_expert_bias_down)
632
+ else:
633
+ self.register_parameter("bias_gate", None)
634
+ self.register_parameter("bias_up", None)
635
+ self.register_parameter("bias_down", None)
636
+
637
+ self.reset_parameters()
638
+
639
+ def reset_parameters(self):
640
+ for i in range(self.num_experts):
641
+ nn.init.kaiming_uniform_(self.weight_gate[i], a=math.sqrt(5))
642
+ nn.init.kaiming_uniform_(self.weight_up[i], a=math.sqrt(5))
643
+ nn.init.kaiming_uniform_(self.weight_down[i], a=math.sqrt(5))
644
+ if self.bias_gate is not None:
645
+ fan_in, _ = nn.init._calculate_fan_in_and_fan_out(self.weight_gate[i])
646
+ bound = 1 / math.sqrt(fan_in)
647
+ nn.init.uniform_(self.bias_gate[i], -bound, bound)
648
+ if self.bias_up is not None:
649
+ fan_in, _ = nn.init._calculate_fan_in_and_fan_out(self.weight_up[i])
650
+ bound = 1 / math.sqrt(fan_in)
651
+ nn.init.uniform_(self.bias_up[i], -bound, bound)
652
+ if self.bias_down is not None:
653
+ fan_in, _ = nn.init._calculate_fan_in_and_fan_out(self.weight_down[i])
654
+ bound = 1 / math.sqrt(fan_in)
655
+ nn.init.uniform_(self.bias_down[i], -bound, bound)
656
+
657
+ def forward(self, input, i):
658
+ gate = self.act_fn(
659
+ F.linear(
660
+ input,
661
+ self.weight_gate[i],
662
+ self.bias_gate[i] if self.bias_gate is not None else None,
663
+ )
664
+ )
665
+ up = F.linear(
666
+ input,
667
+ self.weight_up[i],
668
+ self.bias_up[i] if self.bias_up is not None else None,
669
+ )
670
+ down = F.linear(
671
+ gate * up,
672
+ self.weight_down[i],
673
+ self.bias_down[i] if self.bias_down is not None else None,
674
+ )
675
+ return down
676
+
677
+ def extra_repr(self):
678
+ return (
679
+ "in_features={}, hidden_features={}, out_features={}, hidden_act={},"
680
+ " num_experts={}, size_experts={}, bias={}".format(
681
+ self.in_features,
682
+ self.hidden_features,
683
+ self.out_features,
684
+ self.hidden_act,
685
+ self.num_experts,
686
+ self.size_experts,
687
+ self.bias_gate is not None,
688
+ )
689
+ )
690
+
691
+
692
+ class UniversalCalculator(nn.Module):
693
+ def __init__(
694
+ self,
695
+ experts: LinearGLUExperts,
696
+ multiply_gate_scores=True,
697
+ score_scale_factor=1.0,
698
+ add_weight_norm: bool = False,
699
+ ):
700
+ super(UniversalCalculator, self).__init__()
701
+ self.experts = experts
702
+ # TODO (zhutong): use vmap to boost the training efficiency
703
+ # self.experts_vmap = torch.vmap(self.experts)
704
+ self.multiply_gate_scores = multiply_gate_scores
705
+ self.score_scale_factor = score_scale_factor
706
+ self.num_experts = experts.num_experts
707
+ self.mlp_norm = None
708
+ if multiply_gate_scores and add_weight_norm:
709
+ raise NotImplementedError
710
+
711
+ def reset_experts(self):
712
+ self.experts.reset_parameters()
713
+
714
+ def forward(
715
+ self, x, topK_indices, topK_scores, expert_batch_size=None, **kwargs
716
+ ) -> CalculatorOutput:
717
+ batch_size = topK_indices.size(0) # topK_indices: (bsz*seq_len, num_selects)
718
+ num_selects = topK_indices.size(1)
719
+ topK_indices = topK_indices.flatten() # shape(batch_size*num_selects)
720
+ topK_scores = topK_scores.flatten() # shape(batch_size*num_selects)
721
+ batch_indices = torch.arange(
722
+ batch_size, device=topK_scores.device
723
+ ).repeat_interleave(num_selects)
724
+
725
+ _, index_sorted_topK_indices = topK_indices.sort(0)
726
+
727
+ sorted_topK_scores = topK_scores.index_select(0, index_sorted_topK_indices)
728
+ sorted_batch_indices = batch_indices.index_select(0, index_sorted_topK_indices)
729
+
730
+ if expert_batch_size is None:
731
+ expert_batch_size = topK_indices.bincount(
732
+ minlength=self.num_experts
733
+ ).tolist()
734
+
735
+ sorted_x = x.index_select(0, sorted_batch_indices)
736
+ split_x = torch.split(sorted_x, expert_batch_size, dim=0)
737
+
738
+ expert_outputs = [
739
+ self.experts(split_x[i], i)
740
+ for i in range(self.num_experts)
741
+ if split_x[i].shape[0] > 0
742
+ ]
743
+
744
+ # (bsz*seq_len*num_selects, hidden_size)
745
+ cat_expert_outputs = torch.cat(expert_outputs, 0)
746
+ output_dim = cat_expert_outputs.size(1)
747
+ if self.multiply_gate_scores:
748
+ if self.mlp_norm is None:
749
+ cat_expert_outputs = torch.mul(
750
+ cat_expert_outputs,
751
+ sorted_topK_scores.reshape(-1, 1) * self.score_scale_factor,
752
+ )
753
+ # cat_expert_outputs = torch.mul(cat_expert_outputs, sorted_topK_scores.reshape(-1, 1) * 1.0)
754
+ else:
755
+ cat_expert_outputs = torch.mul(
756
+ cat_expert_outputs, sorted_topK_scores.reshape(-1, 1)
757
+ )
758
+ cat_expert_outputs = self.mlp_norm(cat_expert_outputs)
759
+
760
+ zeros = torch.zeros(
761
+ (batch_size, output_dim),
762
+ device=cat_expert_outputs.device,
763
+ dtype=cat_expert_outputs.dtype,
764
+ )
765
+ y = zeros.index_add(0, sorted_batch_indices, cat_expert_outputs)
766
+
767
+ return CalculatorOutput(hidden_states=y, num_dropped_tokens=torch.tensor(-1.0))
768
+
769
+
770
+ class BaseMoELayer(nn.Module):
771
+ def __init__(self):
772
+ super(BaseMoELayer, self).__init__()
773
+
774
+ self.gate: TopKBalancedNoisyGate
775
+ self.calculator: UniversalCalculator
776
+
777
+ def _create_gate(self, **kwargs):
778
+ self.gate_type = kwargs.get("gate_type", "TopKBalancedNoisyGate")
779
+
780
+ if self.gate_type == "TopKBalancedNoisyGate": # noisy gate
781
+ self.gate = TopKBalancedNoisyGate(
782
+ self.input_size,
783
+ self.num_experts,
784
+ self.num_selects,
785
+ gate_network=kwargs.get("gate_network", "mlp"),
786
+ use_softmax=kwargs.get("gate_use_softmax", True),
787
+ use_balance=kwargs.get("gate_use_balance", True),
788
+ balance_loss_weight=kwargs.get("gate_balance_loss_weight", 1e-2),
789
+ add_noise=kwargs.get("gate_add_noise", True),
790
+ noise_epsilon=kwargs.get("gate_noise_epsilon", 1e-2),
791
+ )
792
+ else:
793
+ raise NotImplementedError
794
+
795
+ def _create_calculator(self, experts, **kwargs):
796
+ self.calculator_type = kwargs.get("calculator_type", "UniversalCalculator")
797
+
798
+ if self.calculator_type == "UniversalCalculator": # top K calculator
799
+ self.calculator = UniversalCalculator(
800
+ experts,
801
+ multiply_gate_scores=kwargs.get("multiply_gate_scores", True),
802
+ score_scale_factor=kwargs.get("score_scale_factor", 1.0),
803
+ add_weight_norm=kwargs.get("add_weight_norm", False),
804
+ )
805
+ else:
806
+ raise NotImplementedError
807
+
808
+ def forward(self, x) -> MoEMlpOutput:
809
+ original_shape = x.shape[:-1]
810
+ x = x.reshape(-1, self.input_size)
811
+ gate_outputs: dict = self.gate(x)
812
+ calc_outs: CalculatorOutput = self.calculator(x, **gate_outputs)
813
+ y = calc_outs.hidden_states
814
+ y = y.reshape(original_shape + (self.output_size,))
815
+
816
+ return MoEMlpOutput(
817
+ hidden_states=y,
818
+ balance_loss=gate_outputs.get("balance_loss"),
819
+ num_dropped_tokens=calc_outs.num_dropped_tokens,
820
+ gate_load=gate_outputs.get("load", torch.tensor(-1)),
821
+ gate_importance=gate_outputs.get("importance", torch.tensor(-1)),
822
+ )
823
+
824
+ def set_num_selects(self, num_selects):
825
+ if "num_selects" not in vars(self.gate):
826
+ raise KeyError(f'{self.gate_type} does not have a key named "num_selects".')
827
+ elif num_selects > self.gate.num_experts:
828
+ raise ValueError(
829
+ 'The value of "num_selects" must satisfy "num_selects <= num_experts"!'
830
+ )
831
+ elif self.gate_type in ("SwitchBalancedGate",):
832
+ raise ValueError(
833
+ f"{self.gate_type} doesn't support manually setting num_selects."
834
+ )
835
+ else:
836
+ self.num_selects = num_selects
837
+ self.gate.num_selects = num_selects
838
+
839
+ def set_gate_use_softmax(self, use_softmax):
840
+ if "use_softmax" not in vars(self.gate):
841
+ raise KeyError(f'{self.gate_type} does not have a key named "use_softmax".')
842
+ else:
843
+ self.gate.use_softmax = use_softmax
844
+
845
+ def set_gate_use_balance(self, use_balance):
846
+ if "use_balance" not in vars(self.gate):
847
+ raise KeyError(f'{self.gate_type} does not have a key named "use_balance".')
848
+ else:
849
+ self.gate.use_balance = use_balance
850
+
851
+ def set_gate_balance_loss_weight(self, balance_loss_weight):
852
+ if "balance_loss_weight" not in vars(self.gate):
853
+ raise KeyError(
854
+ f'{self.gate_type} does not have a key named "balance_loss_weight".'
855
+ )
856
+ else:
857
+ self.gate.balance_loss_weight = balance_loss_weight
858
+
859
+ def set_gate_add_noise(self, add_noise):
860
+ if "add_noise" not in vars(self.gate):
861
+ raise KeyError(f'{self.gate_type} does not have a key named "add_noise".')
862
+ else:
863
+ self.gate.add_noise = add_noise
864
+
865
+ def set_gate_noise_epsilon(self, noise_epsilon):
866
+ if "noise_epsilon" not in vars(self.gate):
867
+ raise KeyError(
868
+ f'{self.gate_type} does not have a key named "noise_epsilon".'
869
+ )
870
+ else:
871
+ self.gate.noise_epsilon = noise_epsilon
872
+
873
+ def set_calculator_multiply_gate_scores(self, multiply_gate_scores):
874
+ if "multiply_gate_scores" not in vars(self.calculator):
875
+ raise KeyError(
876
+ f'{self.gate_type} does not have a key named "multiply_gate_scores".'
877
+ )
878
+ else:
879
+ self.calculator.multiply_gate_scores = multiply_gate_scores
880
+
881
+ def set_calculator_score_scale_factor(self, score_scale_factor):
882
+ if "score_scale_factor" not in vars(self.calculator):
883
+ raise KeyError(
884
+ f'{self.gate_type} does not have a key named "score_scale_factor".'
885
+ )
886
+ else:
887
+ self.calculator.score_scale_factor = score_scale_factor
888
+
889
+ def set_calculator_drop_tokens(self, drop_tokens):
890
+ if "drop_tokens" not in vars(self.calculator):
891
+ raise KeyError(f'{self.gate_type} does not have a key named "drop_tokens".')
892
+ elif (
893
+ drop_tokens
894
+ and self.calculator.dropped_padding != "zero"
895
+ and self.input_size != self.output_size
896
+ ):
897
+ warnings.warn(
898
+ 'Setting "drop_tokens=True" without zero dropped padding when "input_size != output_size" will cause error!'
899
+ )
900
+ else:
901
+ self.calculator.drop_tokens = drop_tokens
902
+
903
+ def set_calculator_dropped_padding(self, dropped_padding):
904
+ if "dropped_padding" not in vars(self.calculator):
905
+ raise KeyError(
906
+ f'{self.gate_type} does not have a key named "dropped_padding".'
907
+ )
908
+ elif dropped_padding not in self.calculator.available_dropped_padding_choices:
909
+ raise ValueError(
910
+ f"'dropped_padding' type not available! (available choices: {self.calculator.available_dropped_padding_choices})"
911
+ )
912
+ elif (
913
+ self.calculator.drop_tokens
914
+ and dropped_padding != "zero"
915
+ and self.input_size != self.output_size
916
+ ):
917
+ warnings.warn(
918
+ f'Setting "dropped_padding={dropped_padding}" with "drop_tokens=True" when "input_size != output_size" will cause error!'
919
+ )
920
+ else:
921
+ self.calculator.dropped_padding = dropped_padding
922
+
923
+ def set_calculator_capacity_factor(self, capacity_factor):
924
+ if "capacity_factor" not in vars(self.calculator):
925
+ raise KeyError(
926
+ f'{self.gate_type} does not have a key named "capacity_factor".'
927
+ )
928
+ else:
929
+ self.calculator.capacity_factor = capacity_factor
930
+
931
+ def reset_gate_network(self):
932
+ self.gate.reset_gate_network()
933
+
934
+ def reset_experts(self):
935
+ self.calculator.reset_experts()
936
+
937
+
938
+ class LinearGLUMoELayer(BaseMoELayer):
939
+ def __init__(
940
+ self,
941
+ input_size,
942
+ hidden_size,
943
+ output_size,
944
+ hidden_act,
945
+ num_experts,
946
+ num_selects,
947
+ size_experts=None,
948
+ bias=True,
949
+ **kwargs,
950
+ ):
951
+ super(LinearGLUMoELayer, self).__init__()
952
+ assert num_selects <= num_experts
953
+ self.input_size = input_size
954
+ self.hidden_size = hidden_size
955
+ self.output_size = output_size
956
+ self.hidden_act = hidden_act
957
+ self.num_experts = num_experts
958
+ self.num_selects = num_selects
959
+ self.size_experts = size_experts
960
+ self.bias = bias
961
+
962
+ experts = LinearGLUExperts(
963
+ input_size,
964
+ hidden_size,
965
+ output_size,
966
+ hidden_act,
967
+ num_experts,
968
+ size_experts=size_experts,
969
+ bias=bias,
970
+ )
971
+
972
+ self._create_gate(**kwargs)
973
+ self._create_calculator(experts, **kwargs)
974
+
975
+
976
+ class LlamaMoEDecoderLayer(nn.Module):
977
+ def __init__(self, config: LlamaMoEConfig, layer_index):
978
+ super().__init__()
979
+
980
+ self.hidden_size = config.hidden_size
981
+ self.self_attn = LlamaAttention(config=config)
982
+ self.mlp = LlamaMLP(config)
983
+ self.input_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
984
+ self.post_attention_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
985
+
986
+ gating_config = {
987
+ # all gates
988
+ "gate_type": config.gate_type,
989
+ "gate_network": config.gate_network,
990
+ "gate_use_softmax": config.gate_use_softmax,
991
+ "gate_use_balance": config.gate_use_balance,
992
+ "gate_balance_loss_weight": config.gate_balance_loss_weight,
993
+ "gate_add_noise": config.gate_add_noise,
994
+ # TopKBalancedNoisyGate
995
+ "gate_noise_epsilon": config.gate_noise_epsilon,
996
+ }
997
+ calculator_config = {
998
+ # all calculators
999
+ "calculator_type": config.calculator_type,
1000
+ "multiply_gate_scores": config.multiply_gate_scores,
1001
+ "score_scale_factor": (
1002
+ config.score_scale_factor[layer_index]
1003
+ if isinstance(config.score_scale_factor, list)
1004
+ else config.score_scale_factor
1005
+ ),
1006
+ "add_weight_norm": config.add_weight_norm,
1007
+ # SwitchDropTokenCalculator
1008
+ "drop_tokens": config.drop_tokens,
1009
+ "dropped_padding": config.dropped_padding,
1010
+ "capacity_factor": config.capacity_factor,
1011
+ }
1012
+
1013
+ self.mlp = LinearGLUMoELayer(
1014
+ input_size=self.hidden_size,
1015
+ hidden_size=config.intermediate_size,
1016
+ output_size=self.hidden_size,
1017
+ hidden_act=config.hidden_act,
1018
+ num_experts=config.num_experts,
1019
+ num_selects=config.num_selects,
1020
+ size_experts=(
1021
+ config.size_experts[layer_index]
1022
+ if config.size_experts is not None
1023
+ else None
1024
+ ),
1025
+ bias=False,
1026
+ **gating_config,
1027
+ **calculator_config,
1028
+ )
1029
+
1030
+ def forward(
1031
+ self,
1032
+ hidden_states,
1033
+ attention_mask=None,
1034
+ position_ids=None,
1035
+ past_key_value=None,
1036
+ output_attentions=False,
1037
+ use_cache=False,
1038
+ ) -> tuple:
1039
+ residual = hidden_states
1040
+ hidden_states = self.input_layernorm(hidden_states)
1041
+
1042
+ # Self Attention
1043
+ hidden_states, self_attn_weights, present_key_value = self.self_attn(
1044
+ hidden_states=hidden_states,
1045
+ attention_mask=attention_mask,
1046
+ position_ids=position_ids,
1047
+ past_key_value=past_key_value,
1048
+ output_attentions=output_attentions,
1049
+ use_cache=use_cache,
1050
+ )
1051
+ hidden_states = residual + hidden_states
1052
+
1053
+ # Fully Connected
1054
+ residual = hidden_states
1055
+ hidden_states = self.post_attention_layernorm(hidden_states)
1056
+ mlp_outs: MoEMlpOutput = self.mlp(hidden_states)
1057
+ hidden_states = residual + mlp_outs.hidden_states
1058
+
1059
+ outputs = (
1060
+ hidden_states,
1061
+ mlp_outs.balance_loss,
1062
+ mlp_outs.num_dropped_tokens,
1063
+ mlp_outs.gate_load,
1064
+ mlp_outs.gate_importance,
1065
+ )
1066
+ if output_attentions:
1067
+ outputs += (self_attn_weights,)
1068
+ if use_cache:
1069
+ outputs += (present_key_value,)
1070
+
1071
+ return outputs
1072
+
1073
+ def set_moe_num_selects(self, num_selects):
1074
+ self.mlp.set_num_selects(num_selects)
1075
+
1076
+ def set_moe_gate_use_softmax(self, use_softmax):
1077
+ self.mlp.set_gate_use_softmax(use_softmax)
1078
+
1079
+ def set_moe_gate_use_balance(self, use_balance):
1080
+ self.mlp.set_gate_use_balance(use_balance)
1081
+
1082
+ def set_moe_gate_balance_loss_weight(self, balance_loss_weight):
1083
+ self.mlp.set_gate_balance_loss_weight(balance_loss_weight)
1084
+
1085
+ def set_moe_gate_add_noise(self, add_noise):
1086
+ self.mlp.set_gate_add_noise(add_noise)
1087
+
1088
+ def set_moe_gate_noise_epsilon(self, noise_epsilon):
1089
+ self.mlp.set_gate_noise_epsilon(noise_epsilon)
1090
+
1091
+ def set_moe_calculator_multiply_gate_scores(self, multiply_gate_scores):
1092
+ self.mlp.set_calculator_multiply_gate_scores(multiply_gate_scores)
1093
+
1094
+ def set_moe_calculator_score_scale_factor(self, score_scale_factor):
1095
+ self.mlp.set_calculator_score_scale_factor(score_scale_factor)
1096
+
1097
+ def set_moe_calculator_drop_tokens(self, drop_tokens):
1098
+ self.mlp.set_calculator_drop_tokens(drop_tokens)
1099
+
1100
+ def set_moe_calculator_dropped_padding(self, dropped_padding):
1101
+ self.mlp.set_calculator_dropped_padding(dropped_padding)
1102
+
1103
+ def set_moe_calculator_capacity_factor(self, capacity_factor):
1104
+ self.mlp.set_calculator_capacity_factor(capacity_factor)
1105
+
1106
+ def reset_gate_network(self):
1107
+ self.mlp.reset_gate_network()
1108
+
1109
+ def reset_experts(self):
1110
+ self.mlp.reset_experts()
1111
+
1112
+
1113
+ class LlamaMoEPreTrainedModel(PreTrainedModel):
1114
+ config_class = LlamaMoEConfig
1115
+ base_model_prefix = "model"
1116
+ supports_gradient_checkpointing = True
1117
+ _no_split_modules = ["LlamaMoEDecoderLayer"]
1118
+ _skip_keys_device_placement = "past_key_values"
1119
+
1120
+ def _init_weights(self, module):
1121
+ std = self.config.initializer_range
1122
+ if isinstance(module, nn.Linear):
1123
+ module.weight.data.normal_(mean=0.0, std=std)
1124
+ if module.bias is not None:
1125
+ module.bias.data.zero_()
1126
+ elif isinstance(module, nn.Embedding):
1127
+ module.weight.data.normal_(mean=0.0, std=std)
1128
+ if module.padding_idx is not None:
1129
+ module.weight.data[module.padding_idx].zero_()
1130
+
1131
+ def _set_gradient_checkpointing(self, module, value=False):
1132
+ if isinstance(module, LlamaMoEModel):
1133
+ module.gradient_checkpointing = value
1134
+
1135
+
1136
+ class LlamaMoEModel(LlamaMoEPreTrainedModel):
1137
+ def __init__(self, config: LlamaMoEConfig):
1138
+ super().__init__(config)
1139
+ self.padding_idx = config.pad_token_id
1140
+ self.vocab_size = config.vocab_size
1141
+
1142
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
1143
+ self.layers = nn.ModuleList(
1144
+ [LlamaMoEDecoderLayer(config, i) for i in range(config.num_hidden_layers)]
1145
+ )
1146
+ self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
1147
+ self.gradient_checkpointing = False
1148
+ self.post_init()
1149
+
1150
+ def get_input_embeddings(self):
1151
+ return self.embed_tokens
1152
+
1153
+ def set_input_embeddings(self, value):
1154
+ self.embed_tokens = value
1155
+
1156
+ # Copied from transformers.models.bart.modeling_bart.BartDecoder._prepare_decoder_attention_mask
1157
+ def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length):
1158
+ # create causal mask
1159
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
1160
+ combined_attention_mask = None
1161
+ if input_shape[-1] > 1:
1162
+ combined_attention_mask = _make_causal_mask(
1163
+ input_shape,
1164
+ inputs_embeds.dtype,
1165
+ device=inputs_embeds.device,
1166
+ past_key_values_length=past_key_values_length,
1167
+ )
1168
+
1169
+ if attention_mask is not None:
1170
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
1171
+ expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(
1172
+ inputs_embeds.device
1173
+ )
1174
+ combined_attention_mask = (
1175
+ expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
1176
+ )
1177
+
1178
+ return combined_attention_mask
1179
+
1180
+ def forward(
1181
+ self,
1182
+ input_ids=None,
1183
+ attention_mask=None,
1184
+ position_ids=None,
1185
+ past_key_values=None,
1186
+ inputs_embeds=None,
1187
+ use_cache=None,
1188
+ output_attentions=None,
1189
+ output_hidden_states=None,
1190
+ return_dict=None,
1191
+ ):
1192
+ output_attentions = (
1193
+ output_attentions
1194
+ if output_attentions is not None
1195
+ else self.config.output_attentions
1196
+ )
1197
+ output_hidden_states = (
1198
+ output_hidden_states
1199
+ if output_hidden_states is not None
1200
+ else self.config.output_hidden_states
1201
+ )
1202
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
1203
+
1204
+ return_dict = (
1205
+ return_dict if return_dict is not None else self.config.use_return_dict
1206
+ )
1207
+
1208
+ # retrieve input_ids and inputs_embeds
1209
+ if input_ids is not None and inputs_embeds is not None:
1210
+ raise ValueError(
1211
+ "You cannot specify both decoder_input_ids and decoder_inputs_embeds at"
1212
+ " the same time"
1213
+ )
1214
+ elif input_ids is not None:
1215
+ batch_size, seq_length = input_ids.shape
1216
+ elif inputs_embeds is not None:
1217
+ batch_size, seq_length, _ = inputs_embeds.shape
1218
+ else:
1219
+ raise ValueError(
1220
+ "You have to specify either decoder_input_ids or decoder_inputs_embeds"
1221
+ )
1222
+
1223
+ seq_length_with_past = seq_length
1224
+ past_key_values_length = 0
1225
+
1226
+ if past_key_values is not None:
1227
+ past_key_values_length = past_key_values[0][0].shape[2]
1228
+ seq_length_with_past = seq_length_with_past + past_key_values_length
1229
+
1230
+ if position_ids is None:
1231
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
1232
+ position_ids = torch.arange(
1233
+ past_key_values_length,
1234
+ seq_length + past_key_values_length,
1235
+ dtype=torch.long,
1236
+ device=device,
1237
+ )
1238
+ position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
1239
+ else:
1240
+ position_ids = position_ids.view(-1, seq_length).long()
1241
+
1242
+ if inputs_embeds is None:
1243
+ inputs_embeds = self.embed_tokens(input_ids)
1244
+ # embed positions
1245
+ if attention_mask is None:
1246
+ attention_mask = torch.ones(
1247
+ (batch_size, seq_length_with_past),
1248
+ dtype=torch.bool,
1249
+ device=inputs_embeds.device,
1250
+ )
1251
+ attention_mask = self._prepare_decoder_attention_mask(
1252
+ attention_mask,
1253
+ (batch_size, seq_length),
1254
+ inputs_embeds,
1255
+ past_key_values_length,
1256
+ )
1257
+
1258
+ hidden_states = inputs_embeds
1259
+ balance_loss = 0.0
1260
+
1261
+ if self.gradient_checkpointing and self.training:
1262
+ if use_cache:
1263
+ logger.warning_once(
1264
+ "`use_cache=True` is incompatible with gradient checkpointing."
1265
+ " Setting `use_cache=False`..."
1266
+ )
1267
+ use_cache = False
1268
+
1269
+ # decoder layers
1270
+ all_hidden_states = () if output_hidden_states else None
1271
+ all_self_attns = () if output_attentions else None
1272
+ next_decoder_cache = () if use_cache else None
1273
+
1274
+ num_dropped_tokens = ()
1275
+ gate_load = ()
1276
+ gate_importance = ()
1277
+ for idx, decoder_layer in enumerate(self.layers):
1278
+ if output_hidden_states:
1279
+ all_hidden_states += (hidden_states,)
1280
+
1281
+ past_key_value = (
1282
+ past_key_values[idx] if past_key_values is not None else None
1283
+ )
1284
+
1285
+ if self.gradient_checkpointing and self.training:
1286
+
1287
+ def create_custom_forward(module):
1288
+ def custom_forward(*inputs):
1289
+ # None for past_key_value
1290
+ return module(*inputs, output_attentions, None)
1291
+
1292
+ return custom_forward
1293
+
1294
+ layer_outputs: tuple = torch.utils.checkpoint.checkpoint(
1295
+ create_custom_forward(decoder_layer),
1296
+ hidden_states,
1297
+ attention_mask,
1298
+ position_ids,
1299
+ None,
1300
+ )
1301
+ else:
1302
+ layer_outputs: tuple = decoder_layer(
1303
+ hidden_states,
1304
+ attention_mask=attention_mask,
1305
+ position_ids=position_ids,
1306
+ past_key_value=past_key_value,
1307
+ output_attentions=output_attentions,
1308
+ use_cache=use_cache,
1309
+ )
1310
+
1311
+ hidden_states = layer_outputs[0]
1312
+ if layer_outputs[1] is not None:
1313
+ balance_loss += layer_outputs[1]
1314
+
1315
+ if use_cache:
1316
+ next_decoder_cache += (layer_outputs[6 if output_attentions else 5],)
1317
+
1318
+ if output_attentions:
1319
+ all_self_attns += (layer_outputs[5],)
1320
+
1321
+ num_dropped_tokens += (layer_outputs[2],)
1322
+ gate_load += (layer_outputs[3],)
1323
+ gate_importance += (layer_outputs[4],)
1324
+
1325
+ hidden_states = self.norm(hidden_states)
1326
+
1327
+ # add hidden states from the last decoder layer
1328
+ if output_hidden_states:
1329
+ all_hidden_states += (hidden_states,)
1330
+
1331
+ next_cache = next_decoder_cache if use_cache else None
1332
+ if not return_dict:
1333
+ return tuple(
1334
+ v
1335
+ for v in [hidden_states, next_cache, all_hidden_states, all_self_attns]
1336
+ if v is not None
1337
+ )
1338
+ return BaseMoEModelOutputWithPast(
1339
+ last_hidden_state=hidden_states,
1340
+ balance_loss=balance_loss,
1341
+ past_key_values=next_cache,
1342
+ hidden_states=all_hidden_states,
1343
+ attentions=all_self_attns,
1344
+ num_dropped_tokens=num_dropped_tokens,
1345
+ gate_load=gate_load,
1346
+ gate_importance=gate_importance,
1347
+ )
1348
+
1349
+ def update_config(self):
1350
+ self.config.vocab_size = self.config.vocab_size
1351
+ self.config.max_position_embeddings = self.config.max_position_embeddings
1352
+ # ↓↓↓↓↓↓↓↓↓↓↓↓ changed here ↓↓↓↓↓↓↓↓↓↓↓↓ #
1353
+ self.config.hidden_size = self.layers[0].mlp.input_size
1354
+ self.config.intermediate_size = self.layers[0].mlp.hidden_size
1355
+ self.config.num_hidden_layers = len(self.layers)
1356
+ self.config.num_attention_heads = self.layers[0].self_attn.num_heads
1357
+ self.config.hidden_act = self.layers[0].mlp.hidden_act
1358
+ # ↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑ #
1359
+ self.config.initializer_range = self.config.initializer_range
1360
+ self.config.rms_norm_eps = self.config.rms_norm_eps
1361
+ self.config.pretraining_tp = self.config.pretraining_tp
1362
+ self.config.use_cache = self.config.use_cache
1363
+ self.config.rope_scaling = self.config.rope_scaling
1364
+ self.config._rope_scaling_validation()
1365
+
1366
+ self.config.num_experts = self.layers[0].mlp.num_experts
1367
+ self.config.num_selects = self.layers[0].mlp.num_selects
1368
+ self.config.size_experts = [
1369
+ self.layers[i].mlp.calculator.experts.size_experts
1370
+ for i in range(self.config.num_hidden_layers)
1371
+ ]
1372
+
1373
+ self.config.gate_type = vars(self.layers[0].mlp).get(
1374
+ "gate_type", "TopKBalancedNoisyGate"
1375
+ )
1376
+ self.config.gate_network = vars(self.layers[0].mlp.gate).get(
1377
+ "gate_network_type", "mlp"
1378
+ )
1379
+ self.config.gate_use_softmax = vars(self.layers[0].mlp.gate).get(
1380
+ "use_softmax", True
1381
+ )
1382
+ self.config.gate_use_balance = vars(self.layers[0].mlp.gate).get(
1383
+ "use_balance", True
1384
+ )
1385
+ self.config.gate_balance_loss_weight = vars(self.layers[0].mlp.gate).get(
1386
+ "balance_loss_weight", 1e-2
1387
+ )
1388
+ self.config.gate_add_noise = vars(self.layers[0].mlp.gate).get(
1389
+ "add_noise", True
1390
+ )
1391
+ self.config.gate_noise_epsilon = vars(self.layers[0].mlp.gate).get(
1392
+ "noise_epsilon", 1e-2
1393
+ )
1394
+
1395
+ self.config.calculator_type = vars(self.layers[0].mlp).get(
1396
+ "calculator_type", "UniversalCalculator"
1397
+ )
1398
+ self.config.multiply_gate_scores = vars(self.layers[0].mlp.calculator).get(
1399
+ "multiply_gate_scores", True
1400
+ )
1401
+ self.config.score_scale_factor = [
1402
+ vars(self.layers[i].mlp.calculator).get("score_scale_factor", 1.0)
1403
+ for i in range(self.config.num_hidden_layers)
1404
+ ]
1405
+ self.config.drop_tokens = vars(self.layers[0].mlp.calculator).get(
1406
+ "drop_tokens", True
1407
+ )
1408
+ self.config.dropped_padding = vars(self.layers[0].mlp.calculator).get(
1409
+ "dropped_padding", "zero"
1410
+ )
1411
+ self.config.capacity_factor = vars(self.layers[0].mlp.calculator).get(
1412
+ "capacity_factor", 1.25
1413
+ )
1414
+
1415
+ def set_moe_num_selects(self, num_selects):
1416
+ for idx, decoder_layer in enumerate(self.layers):
1417
+ decoder_layer.set_moe_num_selects(num_selects)
1418
+
1419
+ def set_moe_gate_use_softmax(self, use_softmax):
1420
+ for idx, decoder_layer in enumerate(self.layers):
1421
+ decoder_layer.set_moe_gate_use_softmax(use_softmax)
1422
+
1423
+ def set_moe_gate_use_balance(self, use_balance):
1424
+ for idx, decoder_layer in enumerate(self.layers):
1425
+ decoder_layer.set_moe_gate_use_balance(use_balance)
1426
+
1427
+ def set_moe_gate_balance_loss_weight(self, balance_loss_weight):
1428
+ for idx, decoder_layer in enumerate(self.layers):
1429
+ decoder_layer.set_moe_gate_balance_loss_weight(balance_loss_weight)
1430
+
1431
+ def set_moe_gate_add_noise(self, add_noise):
1432
+ for idx, decoder_layer in enumerate(self.layers):
1433
+ decoder_layer.set_moe_gate_add_noise(add_noise)
1434
+
1435
+ def set_moe_gate_noise_epsilon(self, noise_epsilon):
1436
+ for idx, decoder_layer in enumerate(self.layers):
1437
+ decoder_layer.set_moe_gate_noise_epsilon(noise_epsilon)
1438
+
1439
+ def set_moe_calculator_multiply_gate_scores(self, multiply_gate_scores):
1440
+ for idx, decoder_layer in enumerate(self.layers):
1441
+ decoder_layer.set_moe_calculator_multiply_gate_scores(multiply_gate_scores)
1442
+
1443
+ def set_moe_calculator_score_scale_factor(
1444
+ self, score_scale_factor, layer_index=None
1445
+ ):
1446
+ if layer_index is None:
1447
+ for idx, decoder_layer in enumerate(self.layers):
1448
+ decoder_layer.set_moe_calculator_score_scale_factor(score_scale_factor)
1449
+ else:
1450
+ self.layers[layer_index].set_moe_calculator_score_scale_factor(
1451
+ score_scale_factor
1452
+ )
1453
+
1454
+ def set_moe_calculator_drop_tokens(self, drop_tokens):
1455
+ for idx, decoder_layer in enumerate(self.layers):
1456
+ decoder_layer.set_moe_calculator_drop_tokens(drop_tokens)
1457
+
1458
+ def set_moe_calculator_dropped_padding(self, dropped_padding):
1459
+ for idx, decoder_layer in enumerate(self.layers):
1460
+ decoder_layer.set_moe_calculator_dropped_padding(dropped_padding)
1461
+
1462
+ def set_moe_calculator_capacity_factor(self, capacity_factor):
1463
+ for idx, decoder_layer in enumerate(self.layers):
1464
+ decoder_layer.set_moe_calculator_capacity_factor(capacity_factor)
1465
+
1466
+ def reset_gate_network(self):
1467
+ for idx, decoder_layer in enumerate(self.layers):
1468
+ decoder_layer.reset_gate_network()
1469
+
1470
+ def reset_experts(self):
1471
+ for idx, decoder_layer in enumerate(self.layers):
1472
+ decoder_layer.reset_experts()
1473
+
1474
+
1475
+ class LlamaMoEForCausalLM(LlamaMoEPreTrainedModel):
1476
+ _tied_weights_keys = ["lm_head.weight"]
1477
+
1478
+ def __init__(self, config):
1479
+ super().__init__(config)
1480
+ self.model = LlamaMoEModel(config)
1481
+ self.pretraining_tp = config.pretraining_tp
1482
+ self.vocab_size = config.vocab_size
1483
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
1484
+
1485
+ # Initialize weights and apply final processing
1486
+ self.post_init()
1487
+
1488
+ def get_input_embeddings(self):
1489
+ return self.model.embed_tokens
1490
+
1491
+ def set_input_embeddings(self, value):
1492
+ self.model.embed_tokens = value
1493
+
1494
+ def get_output_embeddings(self):
1495
+ return self.lm_head
1496
+
1497
+ def set_output_embeddings(self, new_embeddings):
1498
+ self.lm_head = new_embeddings
1499
+
1500
+ def set_decoder(self, decoder):
1501
+ self.model = decoder
1502
+
1503
+ def get_decoder(self):
1504
+ return self.model
1505
+
1506
+ def forward(
1507
+ self,
1508
+ input_ids=None,
1509
+ attention_mask=None,
1510
+ position_ids=None,
1511
+ past_key_values=None,
1512
+ inputs_embeds=None,
1513
+ labels=None,
1514
+ use_cache=None,
1515
+ output_attentions=None,
1516
+ output_hidden_states=None,
1517
+ return_dict=None,
1518
+ **kwargs,
1519
+ ):
1520
+ output_attentions = (
1521
+ output_attentions
1522
+ if output_attentions is not None
1523
+ else self.config.output_attentions
1524
+ )
1525
+ output_hidden_states = (
1526
+ output_hidden_states
1527
+ if output_hidden_states is not None
1528
+ else self.config.output_hidden_states
1529
+ )
1530
+ return_dict = (
1531
+ return_dict if return_dict is not None else self.config.use_return_dict
1532
+ )
1533
+
1534
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
1535
+ outputs: BaseMoEModelOutputWithPast = self.model(
1536
+ input_ids=input_ids,
1537
+ attention_mask=attention_mask,
1538
+ position_ids=position_ids,
1539
+ past_key_values=past_key_values,
1540
+ inputs_embeds=inputs_embeds,
1541
+ use_cache=use_cache,
1542
+ output_attentions=output_attentions,
1543
+ output_hidden_states=output_hidden_states,
1544
+ return_dict=return_dict,
1545
+ )
1546
+
1547
+ hidden_states = outputs.last_hidden_state
1548
+ logits = self.lm_head(hidden_states)
1549
+
1550
+ loss = None
1551
+ if labels is not None:
1552
+ # Shift so that tokens < n predict n
1553
+ shift_logits = logits[..., :-1, :].contiguous()
1554
+ shift_labels = labels[..., 1:].contiguous()
1555
+ # Flatten the tokens
1556
+ loss_fct = nn.CrossEntropyLoss()
1557
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
1558
+ shift_labels = shift_labels.view(-1)
1559
+ # Enable model parallelism
1560
+ shift_labels = shift_labels.to(shift_logits.device)
1561
+ loss = loss_fct(shift_logits, shift_labels)
1562
+ if outputs.balance_loss is not None and outputs.balance_loss > 0:
1563
+ loss += outputs.balance_loss
1564
+
1565
+ if not return_dict:
1566
+ output = (logits,) + outputs[1:]
1567
+ return (loss,) + output if loss is not None else output
1568
+
1569
+ return MoECausalLMOutputWithPast(
1570
+ loss=loss,
1571
+ logits=logits,
1572
+ past_key_values=outputs.past_key_values,
1573
+ hidden_states=outputs.hidden_states,
1574
+ attentions=outputs.attentions,
1575
+ num_dropped_tokens=outputs.num_dropped_tokens,
1576
+ balance_loss=outputs.balance_loss,
1577
+ gate_load=outputs.gate_load,
1578
+ gate_importance=outputs.gate_importance,
1579
+ )
1580
+
1581
+ def prepare_inputs_for_generation(
1582
+ self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
1583
+ ):
1584
+ if past_key_values:
1585
+ input_ids = input_ids[:, -1:]
1586
+
1587
+ position_ids = kwargs.get("position_ids", None)
1588
+ if attention_mask is not None and position_ids is None:
1589
+ # create position_ids on the fly for batch generation
1590
+ position_ids = attention_mask.long().cumsum(-1) - 1
1591
+ position_ids.masked_fill_(attention_mask == 0, 1)
1592
+ if past_key_values:
1593
+ position_ids = position_ids[:, -1].unsqueeze(-1)
1594
+
1595
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
1596
+ if inputs_embeds is not None and past_key_values is None:
1597
+ model_inputs = {"inputs_embeds": inputs_embeds}
1598
+ else:
1599
+ model_inputs = {"input_ids": input_ids}
1600
+
1601
+ model_inputs.update(
1602
+ {
1603
+ "position_ids": position_ids,
1604
+ "past_key_values": past_key_values,
1605
+ "use_cache": kwargs.get("use_cache"),
1606
+ "attention_mask": attention_mask,
1607
+ }
1608
+ )
1609
+ return model_inputs
1610
+
1611
+ @staticmethod
1612
+ def _reorder_cache(past_key_values, beam_idx):
1613
+ reordered_past = ()
1614
+ for layer_past in past_key_values:
1615
+ reordered_past += (
1616
+ tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
1617
+ )
1618
+ return reordered_past
1619
+
1620
+ def update_config(self):
1621
+ self.model.update_config()
1622
+
1623
+ def set_moe_num_selects(self, num_selects):
1624
+ self.model.set_moe_num_selects(num_selects)
1625
+
1626
+ def set_moe_gate_use_softmax(self, use_softmax):
1627
+ self.model.set_moe_gate_use_softmax(use_softmax)
1628
+
1629
+ def set_moe_gate_use_balance(self, use_balance):
1630
+ self.model.set_moe_gate_use_balance(use_balance)
1631
+
1632
+ def set_moe_gate_balance_loss_weight(self, balance_loss_weight):
1633
+ self.model.set_moe_gate_balance_loss_weight(balance_loss_weight)
1634
+
1635
+ def set_moe_gate_add_noise(self, add_noise):
1636
+ self.model.set_moe_gate_add_noise(add_noise)
1637
+
1638
+ def set_moe_gate_noise_epsilon(self, noise_epsilon):
1639
+ self.model.set_moe_gate_noise_epsilon(noise_epsilon)
1640
+
1641
+ def set_moe_calculator_multiply_gate_scores(self, multiply_gate_scores):
1642
+ self.model.set_moe_calculator_multiply_gate_scores(multiply_gate_scores)
1643
+
1644
+ def set_moe_calculator_score_scale_factor(
1645
+ self, score_scale_factor, layer_index=None
1646
+ ):
1647
+ self.model.set_moe_calculator_score_scale_factor(
1648
+ score_scale_factor, layer_index=layer_index
1649
+ )
1650
+
1651
+ def set_moe_calculator_drop_tokens(self, drop_tokens):
1652
+ self.model.set_moe_calculator_drop_tokens(drop_tokens)
1653
+
1654
+ def set_moe_calculator_dropped_padding(self, dropped_padding):
1655
+ self.model.set_moe_calculator_dropped_padding(dropped_padding)
1656
+
1657
+ def set_moe_calculator_capacity_factor(self, capacity_factor):
1658
+ self.model.set_moe_calculator_capacity_factor(capacity_factor)
1659
+
1660
+ def reset_gate_network(self):
1661
+ self.model.reset_gate_network()
1662
+
1663
+ def reset_experts(self):
1664
+ self.model.reset_experts()
smash_config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "api_key": null,
3
+ "verify_url": "http://johnrachwan.pythonanywhere.com",
4
+ "smash_config": {
5
+ "pruners": "None",
6
+ "pruning_ratio": 0.0,
7
+ "factorizers": "None",
8
+ "quantizers": "['llm-int8']",
9
+ "weight_quantization_bits": 4,
10
+ "output_deviation": 0.005,
11
+ "compilers": "None",
12
+ "static_batch": true,
13
+ "static_shape": true,
14
+ "controlnet": "None",
15
+ "unet_dim": 4,
16
+ "device": "cuda",
17
+ "cache_dir": "/ceph/hdd/staff/charpent/.cache/modelsbijycn3y",
18
+ "batch_size": 1,
19
+ "model_name": "llama-moe/LLaMA-MoE-v1-3_5B-2_8",
20
+ "task": "text_text_generation",
21
+ "max_batch_size": 1,
22
+ "qtype_weight": "torch.qint8",
23
+ "qtype_activation": "torch.quint8",
24
+ "qobserver": "<class 'torch.ao.quantization.observer.MinMaxObserver'>",
25
+ "qscheme": "torch.per_tensor_symmetric",
26
+ "qconfig": "x86",
27
+ "group_size": 128,
28
+ "damp_percent": 0.1,
29
+ "save_load_fn": "bitsandbytes"
30
+ }
31
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "unk_token": {
17
+ "content": "<unk>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ }
23
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "add_prefix_space": true,
5
+ "added_tokens_decoder": {
6
+ "0": {
7
+ "content": "<unk>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false,
12
+ "special": true
13
+ },
14
+ "1": {
15
+ "content": "<s>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false,
20
+ "special": true
21
+ },
22
+ "2": {
23
+ "content": "</s>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": false,
27
+ "single_word": false,
28
+ "special": true
29
+ }
30
+ },
31
+ "bos_token": "<s>",
32
+ "clean_up_tokenization_spaces": false,
33
+ "eos_token": "</s>",
34
+ "legacy": false,
35
+ "model_max_length": 1000000000000000019884624838656,
36
+ "pad_token": null,
37
+ "padding_side": "right",
38
+ "sp_model_kwargs": {},
39
+ "spaces_between_special_tokens": false,
40
+ "tokenizer_class": "LlamaTokenizer",
41
+ "unk_token": "<unk>",
42
+ "use_default_system_prompt": false,
43
+ "use_fast": true
44
+ }