ekurtic commited on
Commit
e997309
1 Parent(s): d9bf1da

add readme

Browse files
Files changed (1) hide show
  1. README.md +419 -0
README.md ADDED
@@ -0,0 +1,419 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - fp8
4
+ - vllm
5
+ language:
6
+ - en
7
+ - de
8
+ - fr
9
+ - it
10
+ - pt
11
+ - hi
12
+ - es
13
+ - th
14
+ pipeline_tag: text-generation
15
+ license: llama3.1
16
+ base_model: nvidia/Llama-3.1-Nemotron-70B-Instruct-HF
17
+ ---
18
+
19
+ # Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic
20
+
21
+ ## Model Overview
22
+ - **Model Architecture:** Llama-3.1-Nemotron
23
+ - **Input:** Text
24
+ - **Output:** Text
25
+ - **Model Optimizations:**
26
+ - **Weight quantization:** FP8
27
+ - **Activation quantization:** FP8
28
+ - **Intended Use Cases:** Intended for commercial and research use in multiple languages. Similarly to [Llama-3.1-Nemotron-70B-Instruct](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF), this model is intended for assistant-like chat.
29
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
30
+ - **Release Date:** 10/17/2024
31
+ - **Version:** 1.0
32
+ - **License(s):** [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
33
+ - **Model Developers:** Neural Magic
34
+
35
+ This model is a quantized version of [Llama-3.1-Nemotron-70B-Instruct](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF).
36
+ It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model, including multiple-choice, math reasoning, and open-ended text generation.
37
+ Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic achieves 99.41% recovery for the Arena-Hard evaluation, 100% for OpenLLM v1 (using Meta's prompting when available), and ToDo for OpenLLM v2.
38
+
39
+ ### Model Optimizations
40
+
41
+ This model was obtained by quantizing the weights and activations of [Llama-3.1-Nemotron-70B-Instruct](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF) to FP8 data type, ready for inference with vLLM built from source.
42
+ This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
43
+
44
+ Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the FP8 representations of the quantized weights and activations. Activations are also quantized on a per-token dynamic basis.
45
+
46
+ ## Deployment
47
+
48
+ ### Use with vLLM
49
+
50
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
51
+
52
+ ```python
53
+ from vllm import LLM, SamplingParams
54
+ from transformers import AutoTokenizer
55
+
56
+ model_id = "neuralmagic/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic"
57
+ number_gpus = 2
58
+
59
+ sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
60
+
61
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
62
+
63
+ messages = [
64
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
65
+ {"role": "user", "content": "Who are you?"},
66
+ ]
67
+
68
+ prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
69
+
70
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
71
+
72
+ outputs = llm.generate(prompts, sampling_params)
73
+
74
+ generated_text = outputs[0].outputs[0].text
75
+ print(generated_text)
76
+ ```
77
+
78
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
79
+
80
+ ## Creation
81
+
82
+ This model was created by applying [LLM-Compressor](https://github.com/vllm-project/llm-compressor), as presented in the code snipet below.
83
+
84
+ ```python
85
+ import torch
86
+
87
+ from transformers import AutoTokenizer
88
+
89
+ from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
90
+ from llmcompressor.transformers.compression.helpers import ( # noqa
91
+ calculate_offload_device_map,
92
+ custom_offload_device_map,
93
+ )
94
+
95
+ recipe = """
96
+ quant_stage:
97
+ quant_modifiers:
98
+ QuantizationModifier:
99
+ ignore: ["lm_head"]
100
+ config_groups:
101
+ group_0:
102
+ weights:
103
+ num_bits: 8
104
+ type: float
105
+ strategy: channel
106
+ dynamic: false
107
+ symmetric: true
108
+ input_activations:
109
+ num_bits: 8
110
+ type: float
111
+ strategy: token
112
+ dynamic: true
113
+ symmetric: true
114
+ targets: ["Linear"]
115
+ """
116
+
117
+ model_stub = "nvidia/Llama-3.1-Nemotron-70B-Instruct-HF"
118
+ model_name = model_stub.split("/")[-1]
119
+
120
+ device_map = calculate_offload_device_map(
121
+ model_stub, reserve_for_hessians=False, num_gpus=1, torch_dtype="auto"
122
+ )
123
+
124
+ model = SparseAutoModelForCausalLM.from_pretrained(
125
+ model_stub, torch_dtype="auto", device_map=device_map
126
+ )
127
+
128
+ output_dir = f"./{model_name}-FP8-dynamic"
129
+
130
+ oneshot(
131
+ model=model,
132
+ recipe=recipe,
133
+ output_dir=output_dir,
134
+ save_compressed=True,
135
+ tokenizer=AutoTokenizer.from_pretrained(model_stub),
136
+ )
137
+ ```
138
+
139
+ ## Evaluation
140
+
141
+ This model was evaluated on the well-known Arena-Hard, OpenLLM v1, and OpenLLM v2.
142
+ In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/en/stable/) engine.
143
+
144
+ Arena-Hard evaluations were conducted using the [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) repository.
145
+
146
+ OpenLLM v1 and v2 evaluations were conducted using Neural Magic's fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct).
147
+ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-70B-Instruct-evals) and a few fixes to OpenLLM v2 tasks.
148
+
149
+ ### Accuracy
150
+
151
+ <table>
152
+ <tr>
153
+ <td><strong>Benchmark</strong>
154
+ </td>
155
+ <td><strong>nvidia/Llama-3.1-Nemotron-70B-Instruct-HF</strong>
156
+ </td>
157
+ <td><strong>neuralmagic/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic (this model)</strong>
158
+ </td>
159
+ <td><strong>Recovery</strong>
160
+ </td>
161
+ </tr>
162
+ <tr>
163
+ <td><strong>Arena Hard</strong>
164
+ </td>
165
+ <td>85.0
166
+ </td>
167
+ <td>84.5
168
+ </td>
169
+ <td>99.41%
170
+ </td>
171
+ </tr>
172
+ <tr>
173
+ <td><strong>OpenLLM v1</strong>
174
+ </td>
175
+ </tr>
176
+ <tr>
177
+ <td>MMLU (5-shot)
178
+ </td>
179
+ <td>83.51
180
+ </td>
181
+ <td>83.49
182
+ </td>
183
+ <td>99.97%
184
+ </td>
185
+ </tr>
186
+ <tr>
187
+ <td>MMLU-cot (0-shot)
188
+ </td>
189
+ <td>85.89
190
+ </td>
191
+ <td>86.18
192
+ </td>
193
+ <td>100.33%
194
+ </td>
195
+ </tr>
196
+ <tr>
197
+ <td>ARC Challenge (0-shot)
198
+ </td>
199
+ <td>93.09
200
+ </td>
201
+ <td>93.09
202
+ </td>
203
+ <td>100%
204
+ </td>
205
+ </tr>
206
+ <tr>
207
+ <td>GSM-8K-cot (8-shot, strict-match)
208
+ </td>
209
+ <td>70.13
210
+ </td>
211
+ <td>69.98
212
+ </td>
213
+ <td>99.78%
214
+ </td>
215
+ </tr>
216
+ <tr>
217
+ <td>Hellaswag (10-shot)
218
+ </td>
219
+ <td>87.39
220
+ </td>
221
+ <td>87.22
222
+ </td>
223
+ <td>99.80%
224
+ </td>
225
+ </tr>
226
+ <tr>
227
+ <td>Winogrande (5-shot)
228
+ </td>
229
+ <td>84.93
230
+ </td>
231
+ <td>84.93
232
+ </td>
233
+ <td>100%
234
+ </td>
235
+ </tr>
236
+ <tr>
237
+ <td>TruthfulQA (0-shot, mc2)
238
+ </td>
239
+ <td>55.97
240
+ </td>
241
+ <td>57.12
242
+ </td>
243
+ <td>102.05%
244
+ </td>
245
+ </tr>
246
+ <tr>
247
+ <td><strong>Average</strong>
248
+ </td>
249
+ <td><strong>80.13</strong>
250
+ </td>
251
+ <td><strong>80.29</strong>
252
+ </td>
253
+ <td><strong>100.2%</strong>
254
+ </td>
255
+ </tr>
256
+ <tr>
257
+ <td><strong>OpenLLM v2</strong>
258
+ </td>
259
+ </tr>
260
+ <tr>
261
+ <td>MMLU-Pro (5-shot)
262
+ </td>
263
+ <td>ToDo
264
+ </td>
265
+ <td>ToDo
266
+ </td>
267
+ <td>ToDo
268
+ </td>
269
+ </tr>
270
+ <tr>
271
+ <td>IFEval (0-shot)
272
+ </td>
273
+ <td>73.32
274
+ </td>
275
+ <td>74.08
276
+ </td>
277
+ <td>101.02%
278
+ </td>
279
+ </tr>
280
+ <tr>
281
+ <td>BBH (3-shot)
282
+ </td>
283
+ <td>ToDo
284
+ </td>
285
+ <td>ToDo
286
+ </td>
287
+ <td>ToDo
288
+ </td>
289
+ </tr>
290
+ <tr>
291
+ <td>Math-lvl-5 (4-shot)
292
+ </td>
293
+ <td>23.85
294
+ </td>
295
+ <td>21.78
296
+ </td>
297
+ <td>91.32%
298
+ </td>
299
+ </tr>
300
+ <tr>
301
+ <td>GPQA (0-shot)
302
+ </td>
303
+ <td>34.05
304
+ </td>
305
+ <td>35.97
306
+ </td>
307
+ <td>105.63%
308
+ </td>
309
+ </tr>
310
+ <tr>
311
+ <td>MuSR (0-shot)
312
+ </td>
313
+ <td>13.5
314
+ </td>
315
+ <td>13.35
316
+ </td>
317
+ <td>98.88%
318
+ </td>
319
+ </tr>
320
+ <tr>
321
+ <td><strong>Average</strong>
322
+ </td>
323
+ <td><strong>ToDo</strong>
324
+ </td>
325
+ <td><strong>ToDo</strong>
326
+ </td>
327
+ <td><strong>ToDo</strong>
328
+ </td>
329
+ </tr>
330
+ </table>
331
+
332
+ ### Reproduction
333
+
334
+ The results were obtained using the following commands:
335
+
336
+ #### MMLU
337
+ ```
338
+ lm_eval \
339
+ --model vllm \
340
+ --model_args pretrained="neuralmagic/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=2 \
341
+ --tasks mmlu \
342
+ --num_fewshot 5 \
343
+ --batch_size auto
344
+ ```
345
+
346
+ #### MMLU-cot
347
+ ```
348
+ lm_eval \
349
+ --model vllm \
350
+ --model_args pretrained="neuralmagic/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=2 \
351
+ --tasks mmlu_cot_0shot_llama_3.1_instruct \
352
+ --apply_chat_template \
353
+ --num_fewshot 0 \
354
+ --batch_size auto
355
+ ```
356
+
357
+ #### ARC-Challenge
358
+ ```
359
+ lm_eval \
360
+ --model vllm \
361
+ --model_args pretrained="neuralmagic/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=2 \
362
+ --tasks arc_challenge_llama_3.1_instruct \
363
+ --apply_chat_template \
364
+ --num_fewshot 0 \
365
+ --batch_size auto
366
+ ```
367
+
368
+ #### GSM-8K
369
+ ```
370
+ lm_eval \
371
+ --model vllm \
372
+ --model_args pretrained="neuralmagic/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=2 \
373
+ --tasks gsm8k_cot_llama_3.1_instruct \
374
+ --apply_chat_template \
375
+ --fewshot_as_multiturn \
376
+ --num_fewshot 8 \
377
+ --batch_size auto
378
+ ```
379
+
380
+ #### Hellaswag
381
+ ```
382
+ lm_eval \
383
+ --model vllm \
384
+ --model_args pretrained="neuralmagic/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=2 \
385
+ --tasks hellaswag \
386
+ --num_fewshot 10 \
387
+ --batch_size auto
388
+ ```
389
+
390
+ #### Winogrande
391
+ ```
392
+ lm_eval \
393
+ --model vllm \
394
+ --model_args pretrained="neuralmagic/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=2 \
395
+ --tasks winogrande \
396
+ --num_fewshot 5 \
397
+ --batch_size auto
398
+ ```
399
+
400
+ #### TruthfulQA
401
+ ```
402
+ lm_eval \
403
+ --model vllm \
404
+ --model_args pretrained="neuralmagic/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=2 \
405
+ --tasks truthfulqa \
406
+ --num_fewshot 0 \
407
+ --batch_size auto
408
+ ```
409
+
410
+ #### OpenLLM v2
411
+ ```
412
+ lm_eval \
413
+ --model vllm \
414
+ --model_args pretrained="neuralmagic/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True \
415
+ --apply_chat_template \
416
+ --fewshot_as_multiturn \
417
+ --tasks leaderboard \
418
+ --batch_size auto
419
+ ```