This is great.

#1
by AUTOMATIC - opened

Works with 20 tokens/sec on my 3090 at 8k context.

Is this without speculative decoding? If you use exui or TabbyAPI and enable speculative decoding with a TinyLLaMA 32K model, you can get even faster inference speeds. I can push 40 t/s with the 5.0bpw quant and this draft model:
https://huggingface.co./LoneStriker/TinyLlama-1.1B-32k-Instruct-3.0bpw-h6-exl2

Is this without speculative decoding? If you use exui or TabbyAPI and enable speculative decoding with a TinyLLaMA 32K model, you can get even faster inference speeds. I can push 40 t/s with the 5.0bpw quant and this draft model:
https://huggingface.co./LoneStriker/TinyLlama-1.1B-32k-Instruct-3.0bpw-h6-exl2

Thanks, but I have a question.How much VRAM is needed to run this model on exui and speed it up?Maybe it's because I have too little VRAM and the speculative decoding doesn't work.

Sorry, didn't read that you only got 8k context with the model. You won't have room for a speculative decoding model unfortunately unless we drop from 2.4 down to 2.18bpw (lowest possible quant.) But, 20 t/s is very good already.

FYI, at full 32K context, loading the 2.4bpw model plus a tiny draft model needs 24 GB + 12 GB VRAM. So 2.5 3090s or 4090s. But, you get token speeds of between 40-60 t/s on a 4090 + 3090 Ti.

Sign up or log in to comment