llama-2-7b-chat-marlin

Example of converting a GPTQ model to Marlin format for fast batched decoding with Marlin Kernels

Install Marlin

pip install torch
git clone https://github.com/IST-DASLab/marlin.git
cd marlin
pip install -e .

Convert Model

Convert the model from GPTQ to Marlin format. Note that this requires:

  • sym=true
  • group_size=128
  • desc_activations=false
pip install -U transformers accelerate auto-gptq optimum

Convert with the convert.py script in this repo:

python3 convert.py --model-id "TheBloke/Llama-2-7B-Chat-GPTQ" --save-path "./marlin-model" --do-generation

Run Model

Load with the load.load_model utility from this repo and run inference as usual.

from load import load_model
from transformers import AutoTokenizer

# Load model from disk.
model_path = "./marlin-model"
model = load_model(model_path).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Generate text.
inputs = tokenizer("My favorite song is", return_tensors="pt")
inputs = {k: v.to("cuda") for k, v in inputs.items()}
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False)
print(tokenizer.batch_decode(outputs)[0])
Downloads last month
1,609
Safetensors
Model size
1.12B params
Tensor type
I32
·
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.