alvarobartt HF staff commited on
Commit
b5422fe
1 Parent(s): 32958f2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -1
README.md CHANGED
@@ -127,6 +127,7 @@ Then you just need to run the TGI v2.2.0 (or higher) Docker container as follows
127
 
128
  ```bash
129
  docker run --gpus all --shm-size 1g -ti -p 8080:80 \
 
130
  -e MODEL_ID=hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
131
  -e NUM_SHARD=4 \
132
  -e QUANTIZE=awq \
@@ -139,7 +140,7 @@ docker run --gpus all --shm-size 1g -ti -p 8080:80 \
139
  > [!NOTE]
140
  > TGI will expose different endpoints, to see all the endpoints available check [TGI OpenAPI Specification](https://huggingface.github.io/text-generation-inference/#/).
141
 
142
- To send request to the deployed TGI endpoint compatible with [OpenAI specification](https://github.com/openai/openai-openapi) i.e. `/v1/chat/completions`:
143
 
144
  ```bash
145
  curl 0.0.0.0:8080/v1/chat/completions \
@@ -182,6 +183,59 @@ chat_completion = client.chat.completions.create(
182
  )
183
  ```
184
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
185
  ## Quantization Reproduction
186
 
187
  > [!NOTE]
 
127
 
128
  ```bash
129
  docker run --gpus all --shm-size 1g -ti -p 8080:80 \
130
+ -v hf_cache:/data \
131
  -e MODEL_ID=hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
132
  -e NUM_SHARD=4 \
133
  -e QUANTIZE=awq \
 
140
  > [!NOTE]
141
  > TGI will expose different endpoints, to see all the endpoints available check [TGI OpenAPI Specification](https://huggingface.github.io/text-generation-inference/#/).
142
 
143
+ To send request to the deployed TGI endpoint compatible with [OpenAI OpenAPI specification](https://github.com/openai/openai-openapi) i.e. `/v1/chat/completions`:
144
 
145
  ```bash
146
  curl 0.0.0.0:8080/v1/chat/completions \
 
183
  )
184
  ```
185
 
186
+ ### vLLM
187
+
188
+ To run vLLM with Llama 3.1 70B Instruct AWQ in INT4, you will need to have Docker installed (see [installation notes](https://docs.docker.com/engine/install/)) and run the latest vLLM Docker container as follows:
189
+
190
+ ```bash
191
+ docker run --runtime nvidia --gpus all --ipc=host -p 8000:8000 \
192
+ -v hf_cache:/root/.cache/huggingface \
193
+ vllm/vllm-openai:latest \
194
+ --model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
195
+ --tensor-parallel-size 4 \
196
+ --max-model-len 4096
197
+ ```
198
+
199
+ To send request to the deployed vLLM endpoint compatible with [OpenAI OpenAPI specification](https://github.com/openai/openai-openapi) i.e. `/v1/chat/completions`:
200
+
201
+ ```bash
202
+ curl 0.0.0.0:8000/v1/chat/completions \
203
+ -X POST \
204
+ -H 'Content-Type: application/json' \
205
+ -d '{
206
+ "model": "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",
207
+ "messages": [
208
+ {
209
+ "role": "system",
210
+ "content": "You are a helpful assistant."
211
+ },
212
+ {
213
+ "role": "user",
214
+ "content": "What is Deep Learning?"
215
+ }
216
+ ],
217
+ "max_tokens": 128
218
+ }'
219
+ ```
220
+
221
+ Or programatically via the `openai` Python client (see [installation notes](https://github.com/openai/openai-python?tab=readme-ov-file#installation)) as follows:
222
+
223
+ ```python
224
+ import os
225
+ from openai import OpenAI
226
+
227
+ client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key=os.getenv("VLLM_API_KEY", "-"))
228
+
229
+ chat_completion = client.chat.completions.create(
230
+ model="hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",
231
+ messages=[
232
+ {"role": "system", "content": "You are a helpful assistant."},
233
+ {"role": "user", "content": "What is Deep Learning?"},
234
+ ],
235
+ max_tokens=128,
236
+ )
237
+ ```
238
+
239
  ## Quantization Reproduction
240
 
241
  > [!NOTE]