The `tokenizer_config.json` is missing the `chat_template` jinja?
First gotta say, thanks so much for super fast release of the DeepSeek-R1-Distill quants!
I decided to kick the tires on this one after discovering vllm does not yet support the new unsloth-bnb-4bit
quants.
So this regular vanilla bnb-4bit
quant fired right up on my 3090TI FE gobbling up almost all the 24GB VRAM with 8192 ctx like so:
vllm serve \
"unsloth/DeepSeek-R1-Distill-Qwen-32B-bnb-4bit" \
--load-format bitsandbytes \
--quantization bitsandbytes \
--max-model-len=8192 \
--gpu-memory-utilization=0.99 \
--enforce-eager \
--host 127.0.0.1 \
--port 8080
However, after trying a simple inference, I got this error message:
ValueError: As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.
I got it working by copy pasting the original tokenizer_config.json JINJA chat_template
line into this model's tokenizer_config.json
by editing ~/.cache/huggingface/hub/models--unsloth--DeepSeek-R1-Distill-Qwen-32B-bnb-4bit/snapshots/55602850ff45cd8cefce24d0c4472fd5c5794616/tokenizer_config.json
which got it up and running!
Not sure if this tokenizer_config.json is missing anything else, but have enough to start inferencing now!
Cheers and looking forward to vllm supporting your new unsloth-bnb-4bit
flavor quants!
EDIT: I have some example output and limited benchmarks over on r/LocalLLaMA
First gotta say, thanks so much for super fast release of the DeepSeek-R1-Distill quants!
I decided to kick the tires on this one after discovering vllm does not yet support the new
unsloth-bnb-4bit
quants.So this regular vanilla
bnb-4bit
quant fired right up on my 3090TI FE gobbling up almost all the 24GB VRAM with 8192 ctx like so:
vllm serve \ "unsloth/DeepSeek-R1-Distill-Qwen-32B-bnb-4bit" \ --load-format bitsandbytes \ --quantization bitsandbytes \ --max-model-len=8192 \ --gpu-memory-utilization=0.99 \ --enforce-eager \ --host 127.0.0.1 \ --port 8080
However, after trying a simple inference, I got this error message:
ValueError: As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.
I got it working by copy pasting the original tokenizer_config.json JINJA
chat_template
line into this model'stokenizer_config.json
by editing~/.cache/huggingface/hub/models--unsloth--DeepSeek-R1-Distill-Qwen-32B-bnb-4bit/snapshots/55602850ff45cd8cefce24d0c4472fd5c5794616/tokenizer_config.json
which got it up and running!Not sure if this tokenizer_config.json is missing anything else, but have enough to start inferencing now!
Cheers and looking forward to vllm supporting your new
unsloth-bnb-4bit
flavor quants!EDIT: I have some example output and limited benchmarks over on r/LocalLLaMA
Amazing analysis thank you - we'll need to investigate further