mgoin commited on
Commit
d3aca23
·
verified ·
1 Parent(s): 1bb3f91

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -7
README.md CHANGED
@@ -2,21 +2,92 @@
2
  tags:
3
  - fp8
4
  - vllm
 
 
 
 
 
 
 
 
 
 
 
 
5
  ---
6
 
7
- Run with `vllm==0.6.2` on 1xH100:
8
- ```
9
- vllm serve neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic --enforce-eager --max-num-seqs 16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ```
11
 
12
- ## Evaluation
13
 
14
  ```
15
- TBD
16
  ```
17
 
18
  ## Creation
19
- https://github.com/vllm-project/llm-compressor/pull/185
 
20
 
21
  ```python
22
  from transformers import AutoProcessor, MllamaForConditionalGeneration
@@ -52,4 +123,12 @@ input_ids = processor(text="Hello my name is", return_tensors="pt").input_ids.to
52
  output = model.generate(input_ids, max_new_tokens=20)
53
  print(processor.decode(output[0]))
54
  print("==========================================")
55
- ```
 
 
 
 
 
 
 
 
 
2
  tags:
3
  - fp8
4
  - vllm
5
+ language:
6
+ - en
7
+ - de
8
+ - fr
9
+ - it
10
+ - pt
11
+ - hi
12
+ - es
13
+ - th
14
+ pipeline_tag: text-generation
15
+ license: llama3.2
16
+ base_model: meta-llama/Llama-3.2-11B-Vision-Instruct
17
  ---
18
 
19
+ # Llama-3.2-11B-Vision-Instruct-FP8-dynamic
20
+
21
+ ## Model Overview
22
+ - **Model Architecture:** Meta-Llama-3.2
23
+ - **Input:** Text/Image
24
+ - **Output:** Text
25
+ - **Model Optimizations:**
26
+ - **Weight quantization:** FP8
27
+ - **Activation quantization:** FP8
28
+ - **Intended Use Cases:** Intended for commercial and research use in multiple languages. Similarly to [Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct), this models is intended for assistant-like chat.
29
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
30
+ - **Release Date:** 9/25/2024
31
+ - **Version:** 1.0
32
+ - **License(s):** [llama3.2](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct/blob/main/LICENSE.txt)
33
+ - **Model Developers:** Neural Magic
34
+
35
+ Quantized version of [Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct).
36
+
37
+ ### Model Optimizations
38
+
39
+ This model was obtained by quantizing the weights and activations of [Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) to FP8 data type, ready for inference with vLLM built from source.
40
+ This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
41
+
42
+ Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the FP8 representations of the quantized weights and activations. Activations are also quantized on a per-token dynamic basis.
43
+ [LLM Compressor](https://github.com/vllm-project/llm-compressor) is used for quantization.
44
+
45
+ ## Deployment
46
+
47
+ ### Use with vLLM
48
+
49
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
50
+
51
+ ```python
52
+ from vllm import LLM, SamplingParams
53
+ from vllm.assets.image import ImageAsset
54
+
55
+ # Initialize the LLM
56
+ model_name = "neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic"
57
+ llm = LLM(model=model_name, max_num_seqs=1, enforce_eager=True)
58
+
59
+ # Load the image
60
+ image = ImageAsset("cherry_blossom").pil_image.convert("RGB")
61
+
62
+ # Create the prompt
63
+ question = "If I had to write a haiku for this one, it would be: "
64
+ prompt = f"<|image|><|begin_of_text|>{question}"
65
+
66
+ # Set up sampling parameters
67
+ sampling_params = SamplingParams(temperature=0.2, max_tokens=30)
68
+
69
+ # Generate the response
70
+ inputs = {
71
+ "prompt": prompt,
72
+ "multi_modal_data": {
73
+ "image": image
74
+ },
75
+ }
76
+ outputs = llm.generate(inputs, sampling_params=sampling_params)
77
+
78
+ # Print the generated text
79
+ print(outputs[0].outputs[0].text)
80
  ```
81
 
82
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
83
 
84
  ```
85
+ vllm serve neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic --enforce-eager --max-num-seqs 16
86
  ```
87
 
88
  ## Creation
89
+
90
+ This model was created by applying [LLM Compressor](https://github.com/vllm-project/llm-compressor/blob/f90013702b15bd1690e4e2fe9ed434921b6a6199/examples/quantization_w8a8_fp8/llama3.2_vision_example.py), as presented in the code snipet below.
91
 
92
  ```python
93
  from transformers import AutoProcessor, MllamaForConditionalGeneration
 
123
  output = model.generate(input_ids, max_new_tokens=20)
124
  print(processor.decode(output[0]))
125
  print("==========================================")
126
+ ```
127
+
128
+ ## Evaluation
129
+
130
+ TBD
131
+
132
+ ### Reproduction
133
+
134
+ TBD