This model is quantized by autoawq package using tctsung/chat_restaurant_recommendation as calibration dataset

Reference model: TinyLlama/TinyLlama-1.1B-Chat-v1.0

Key results:

  1. AWQ quantization resulted in a 1.62x improvement in inference speed, generating 140.47 new tokens per second.
  2. The model size was compressed from 4.4GB to 0.78GB, representing a reduction in memory footprint to only 17.57% of the original model.
  3. I used 6 different LLM tasks to demonstrate that the quantized model maintains similar accuracy, with a maximum accuracy degradation of only ~1%

For more details, see github repo tctsung/LLM_quantize

Inference tutorial

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

# load model & tokenizer:
model_id = "tctsung/TinyLlama-1.1B-chat-v1.0-awq"
model = LLM(model = model_id, dtype='half', 
            quantization='awq', gpu_memory_utilization=0.9)
sampling_params = SamplingParams(temperature=1.0,
                                 max_tokens=1024,
                                 min_p=0.5,
                                 top_p=0.85)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# define your own sys & user msg:
sys_msg = "..."
user_msg = "..."
chat_msg = [
            {"role": "system", "content": sys_msg},
            {"role": "user",  "content": user_msg}
        ]
input_text = tokenizer.apply_chat_template(chat_msg, tokenize=False, add_generation_prompt=False)  
output = model.generate(input_text, sampling_params)
output_text = output[0].outputs[0].text
print(output_text)   # show the model output
Downloads last month
86
Safetensors
Model size
261M params
Tensor type
I32
·
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train tctsung/TinyLlama-1.1B-chat-v1.0-awq