lambdalabs/Llama-3.3-70B-Instruct-AWQ-4bit

The original Llama 3.3 70B Instruct model quantized using AutoAWQ. Follow the instruction here.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'meta-llama/Llama-3.3-70B-Instruct'
quant_path = 'Llama-3.3-70B-Instruct-AWQ-4bit'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(
            model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
            )
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

vLLM serve

vllm serve lambdalabs/Llama-3.3-70B-Instruct-AWQ-4bit \
--swap-space 16 \
--disable-log-requests \
--tokenizer meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 2

Benchmark

python benchmark_serving.py \
--backend vllm \
--model lambdalabs/Llama-3.3-70B-Instruct-AWQ-4bit \
--tokenizer meta-llama/Meta-Llama-3-70B \
--dataset-name sharegpt \
--dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 1000

============ Serving Benchmark Result ============
Successful requests:                     902       
Benchmark duration (s):                  128.07    
Total input tokens:                      177877    
Total generated tokens:                  182359    
Request throughput (req/s):              7.04      
Output token throughput (tok/s):         1423.85   
Total Token throughput (tok/s):          2812.71   
---------------Time to First Token----------------
Mean TTFT (ms):                          47225.59  
Median TTFT (ms):                        43313.95  
P99 TTFT (ms):                           105587.66 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          141.01    
Median TPOT (ms):                        148.94    
P99 TPOT (ms):                           174.16    
---------------Inter-token Latency----------------
Mean ITL (ms):                           131.55    
Median ITL (ms):                         150.82    
P99 ITL (ms):                            344.50    
==================================================