Align tokenizer with mistral-common
This PR should align the Hugging Face tokenizer with the tokenization in mistral-common. You can test it with the following script:
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from transformers import AutoTokenizer
chat = [
{"role": "system", "content": "You are a helpful bot"},
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hi there!"},
{"role": "user", "content": "How are you?"},
{"role": "assistant", "content": "Fine and you?"},
{"role": "user", "content": "Fine thank you."},
]
mistral_tok = MistralTokenizer.v1()
hf_tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", revision="pr/120")
hf_text = hf_tokenizer.apply_chat_template(chat, tokenize=False)
hf_tokens = hf_tokenizer.apply_chat_template(chat, tokenize=True)
mistral_encode = mistral_tok.encode_chat_completion(
ChatCompletionRequest(messages=chat)
)
mistral_text = mistral_encode.text
mistral_tokens = mistral_encode.tokens
print(hf_tokens == mistral_tokens)
print(hf_text == mistral_text.replace("▁", " ").replace("<0x0A>", "\n"))
Hey
@Rocketknight1
Thank you for the quick fix!
It passes our assertion tests, excluding the system prompt, which was a problem before the issue anyway.
The mistral common library adds the system prompt with /n/n at the start of the first user message if it is provided.
Is there a reason why you are not adding it to the chat template, in the same manner?
Hi
@ChristianPalaArtificialy
, no reason! We didn't know that was the preferred behaviour, because mistral-common
didn't exist when those templates were written. If you'd like, I can amend the templates to do that.
To be clear, this means that you want something like this, correct? (obviously without destructively editing the input chat)
if messages[0]['role'] == 'system':
messages[1]['content'] = messages[0]['content'] + '\n\n' + messages[1]['content']
messages = messages[1:]
for message in messages:
# Render the messages as normal here
Also, is this the behaviour you want for models using both the v3
and v1
tokenizers?
Hi
@Rocketknight1
,
True, we also speculated about how the system message was added during fine-tuning before the mistral-common library was published.
There is no change in that part of the code between v3 and v1, so yes that would be our preferred behavior!
@ChristianPalaArtificialy
updated the template for the v1
models! I'll work on the v3
models in a sec. Please let me know if it's passing all your tests!
v3 models also updated
Hey
@Rocketknight1
/n/n is encoded as a hexadecimal <0x0A><0x0A> by the mistral-common library, so I'm getting failures on the text comparisons between the two methods now (but that's just an artifact of how the mistral-common library returns the text).
The assertions on the token comparisons are all good on our end.
Nice!
All PRs should be open now - I think everything's ready to merge, but let me know if there's anything else you need to do first.
Ok!