This model has been quantized using GPTQModel.
- bits: 4
- group_size: 128
- desc_act: true
- static_groups: false
- sym: true
- lm_head: false
- damp_percent: 0.01
- true_sequential: true
- model_name_or_path: ""
- model_file_base_name: "model"
- quant_method: "gptq"
- checkpoint_format: "gptq"
- meta:
- quantizer: "gptqmodel:0.9.9-dev0"
Here is an example:
import torch
from transformers import AutoTokenizer
from gptqmodel import GPTQModel
device = torch.device("cuda:0")
model_name = "ModelCloud/Meta-Llama-3.1-8B-gptq-4bit"
prompt = "I am in Shanghai, preparing to visit the natural history museum. Can you tell me the best way to"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = GPTQModel.from_quantized(model_name)
inputs = tokenizer(prompt, return_tensors="pt").to(device)
res = model.generate(**inputs, num_beams=1, min_new_tokens=1, max_new_tokens=512)
print(tokenizer.decode(res[0]))
- Downloads last month
- 158
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.