hugging-quants
/

Meta-Llama-3.1-70B-Instruct-AWQ-INT4

Text Generation

text-generation-inference

Inference Endpoints

4-bit precision

Model card Files Files and versions Community

alvarobartt HF staff commited on Jul 24

Commit

e4777d6

•

1 Parent(s): 51e41de

Update README.md

Files changed (1) hide show

README.md +32 -35

README.md CHANGED Viewed

@@ -127,13 +127,13 @@ Then you just need to run the TGI v2.2.0 (or higher) Docker container as follows
 ```bash
 docker run --gpus all --shm-size 1g -ti -p 8080:80 \
-    -e MODEL_ID=hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
-    -e NUM_SHARD=4 \
-    -e QUANTIZE=awq \
-    -e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
-    -e MAX_INPUT_LENGTH=4000 \
-    -e MAX_TOTAL_TOKENS=4096 \
-    ghcr.io/huggingface/text-generation-inference:2.2.0
 ```
 > [!NOTE]
@@ -143,42 +143,39 @@ To send request to the deployed TGI endpoint compatible with [OpenAI specificati
 ```bash
 curl 0.0.0.0:8080/v1/chat/completions \
-    -X POST \
-    -H 'Content-Type: application/json' \
-    -d '{
-        "model": "tgi",
-        "messages": [
-            {
-                "role": "system",
-                "content": "You are a helpful assistant."
-            },
-            {
-                "role": "user",
-                "content": "What is Deep Learning?"
-            }
-        ],
-        "max_tokens": 128
-    }'
 ```
-Or via the `openai` Python SDK (see [installation notes](https://github.com/openai/openai-python?tab=readme-ov-file#installation)) as:
 ```python
 import os
-from openai import OpenAI
-client = OpenAI(
-    base_url="http://0.0.0.0:8080/v1/",
-    api_key=os.getenv("HF_TOKEN"),
-)
 chat_completion = client.chat.completions.create(
-    model="tgi",
-    messages=[
-        {"role": "system", "content": "You are a helpful assistant."},
-        {"role": "user", "content": "What is Deep Learning?"},
-    ],
-    max_tokens=128,
 )
 ```

 ```bash
 docker run --gpus all --shm-size 1g -ti -p 8080:80 \
+  -e MODEL_ID=hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
+  -e NUM_SHARD=4 \
+  -e QUANTIZE=awq \
+  -e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
+  -e MAX_INPUT_LENGTH=4000 \
+  -e MAX_TOTAL_TOKENS=4096 \
+  ghcr.io/huggingface/text-generation-inference:2.2.0
 ```
 > [!NOTE]
 ```bash
 curl 0.0.0.0:8080/v1/chat/completions \
+  -X POST \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "model": "tgi",
+    "messages": [
+      {
+        "role": "system",
+        "content": "You are a helpful assistant."
+      },
+      {
+        "role": "user",
+        "content": "What is Deep Learning?"
+      }
+    ],
+    "max_tokens": 128
+  }'
 ```
+Or programatically via the `huggingface_hub` Python client as follows (TGI is fully compatible with OpenAI so its `openai` SDK can also be used):
 ```python
 import os
+from huggingface_hub import InferenceClient  # Instead of `from openai import OpenAI`
+client = InferenceClient(base_url="http://0.0.0.0:8080/v1", api_key=os.getenv("HF_TOKEN", "-"))  # Instead of `client = OpenAI(base_url=..., api_key=...)
 chat_completion = client.chat.completions.create(
+  model="hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",  # Instead of `model="tgi"`
+  messages=[
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "What is Deep Learning?"},
+  ],
+  max_tokens=128,
 )
 ```