The `tokenizer_config.json` is missing the `chat_template` jinja?

by ubergarm - opened 4 days ago

Discussion

ubergarm

4 days ago

•

edited 4 days ago

First gotta say, thanks so much for super fast release of the DeepSeek-R1-Distill quants!

I decided to kick the tires on this one after discovering vllm does not yet support the new unsloth-bnb-4bit quants.

So this regular vanilla bnb-4bit quant fired right up on my 3090TI FE gobbling up almost all the 24GB VRAM with 8192 ctx like so:

vllm serve \
      "unsloth/DeepSeek-R1-Distill-Qwen-32B-bnb-4bit" \
      --load-format bitsandbytes \
      --quantization bitsandbytes \
      --max-model-len=8192 \
      --gpu-memory-utilization=0.99 \
      --enforce-eager \
      --host 127.0.0.1 \
      --port 8080

However, after trying a simple inference, I got this error message:

ValueError: As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.

I got it working by copy pasting the original tokenizer_config.json JINJA chat_template line into this model's tokenizer_config.json by editing ~/.cache/huggingface/hub/models--unsloth--DeepSeek-R1-Distill-Qwen-32B-bnb-4bit/snapshots/55602850ff45cd8cefce24d0c4472fd5c5794616/tokenizer_config.json which got it up and running!

Not sure if this tokenizer_config.json is missing anything else, but have enough to start inferencing now!

Cheers and looking forward to vllm supporting your new unsloth-bnb-4bit flavor quants!

EDIT: I have some example output and limited benchmarks over on r/LocalLLaMA

shimmyshimmer

Unsloth AI org 3 days ago

First gotta say, thanks so much for super fast release of the DeepSeek-R1-Distill quants!

I decided to kick the tires on this one after discovering vllm does not yet support the new unsloth-bnb-4bit quants.

So this regular vanilla bnb-4bit quant fired right up on my 3090TI FE gobbling up almost all the 24GB VRAM with 8192 ctx like so:
vllm serve \
      "unsloth/DeepSeek-R1-Distill-Qwen-32B-bnb-4bit" \
      --load-format bitsandbytes \
      --quantization bitsandbytes \
      --max-model-len=8192 \
      --gpu-memory-utilization=0.99 \
      --enforce-eager \
      --host 127.0.0.1 \
      --port 8080
However, after trying a simple inference, I got this error message:
ValueError: As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.
I got it working by copy pasting the original tokenizer_config.json JINJA chat_template line into this model's tokenizer_config.json by editing ~/.cache/huggingface/hub/models--unsloth--DeepSeek-R1-Distill-Qwen-32B-bnb-4bit/snapshots/55602850ff45cd8cefce24d0c4472fd5c5794616/tokenizer_config.json which got it up and running!

Not sure if this tokenizer_config.json is missing anything else, but have enough to start inferencing now!

Cheers and looking forward to vllm supporting your new unsloth-bnb-4bit flavor quants!

EDIT: I have some example output and limited benchmarks over on r/LocalLLaMA

Amazing analysis thank you - we'll need to investigate further

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment