This is great.

by AUTOMATIC - opened Jan 31

Discussion

AUTOMATIC

Jan 31

Works with 20 tokens/sec on my 3090 at 8k context.

LoneStriker

Owner Jan 31

Is this without speculative decoding? If you use exui or TabbyAPI and enable speculative decoding with a TinyLLaMA 32K model, you can get even faster inference speeds. I can push 40 t/s with the 5.0bpw quant and this draft model:
https://huggingface.co./LoneStriker/TinyLlama-1.1B-32k-Instruct-3.0bpw-h6-exl2

DataSoul

Jan 31

•

edited Jan 31

Is this without speculative decoding? If you use exui or TabbyAPI and enable speculative decoding with a TinyLLaMA 32K model, you can get even faster inference speeds. I can push 40 t/s with the 5.0bpw quant and this draft model:
https://huggingface.co./LoneStriker/TinyLlama-1.1B-32k-Instruct-3.0bpw-h6-exl2

Thanks, but I have a question.How much VRAM is needed to run this model on exui and speed it up?Maybe it's because I have too little VRAM and the speculative decoding doesn't work.

LoneStriker

Owner Jan 31

Sorry, didn't read that you only got 8k context with the model. You won't have room for a speculative decoding model unfortunately unless we drop from 2.4 down to 2.18bpw (lowest possible quant.) But, 20 t/s is very good already.

LoneStriker

Owner Jan 31

FYI, at full 32K context, loading the 2.4bpw model plus a tiny draft model needs 24 GB + 12 GB VRAM. So 2.5 3090s or 4090s. But, you get token speeds of between 40-60 t/s on a 4090 + 3090 Ti.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment