Tokenizer config is wrong

#10

by stoshniwal - opened 2 days ago

2 days ago

•

https://huggingface.co./deepseek-ai/DeepSeek-R1-Distill-Qwen-32B/blob/2d78713b01ecefe27a89fafec248a5dfd731396f/tokenizer_config.json#L33

LlamaTokenizerFast -> Qwen2Tokenizer

JaheimLee

1 day ago

Qwen always uses Qwen2Tokenizer.

stoshniwal

1 day ago

•

edited 1 day ago

Sorry updated the tokenizer class in the first comment. The current tokenizer config states the tokenizer class as LlamaTokenizerFast.

jsalix

1 day ago

@bartowski sorry if this is something you were already aware of, could this be causing some of the issues on local usage? I checked and it seems all the Qwen-based distills have the same Llama tokenizer class instead of the Qwen one used on the respective base models

bartowski

1 day ago

It seeeeems unlikely, just since llama.cpp uses its own tokenizer, however it is possible that the existing conversion code was based on an incorrect tokenizer

But that should still not be a problem with the final result I think

I've seen people have better results with lower temperature and proper prompting

@ngxson any thoughts?

ngxson

about 20 hours ago

For GGUF the tokenizer is defined by Model class, not Tokenizer class, so it's not important what is the value in tokenizer_config.json

bartowski

about 14 hours ago

That's what I thought, thanks for confirming!

Fizzarolli

about 3 hours ago

https://huggingface.co./deepseek-ai/DeepSeek-R1-Distill-Qwen-7B/discussions/4
I think all the qwen ones may or may not be completely busted and have the wrong tokenizer config and special tokens (both in lcpp and transforemrs) :/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment