Allow prefilling assistant message

#49
by tomasmcm - opened

The current chat template wraps every message in '<|im_start|>' + message.role + '\n' and '<|im_end|>' + '\n', and adds '<|im_start|>assistant\n<think>\n' at the end of all the messages.
This means it's not possible to send a prefill assistant message that the model continues the generation from.

However, if we change the template to this: https://gist.github.com/tomasmcm/6fd3397eb44e3fbef4cf876451098a92 (note the loop.last checks and role != "assistant" at the end), we are able to have the model continue from a message it received.

This approach would allow building "reasoning_effort" or "thinking_budget_tokens" solutions. By counting the thinking tokens as they are generated we can ensure it does not go over a limit, and if it does, we halt the generation, add \n</think>\n\n to the end, and send it back to the model as a prefilled assistant message. This way the model continues using the existing <think> process and generates the final answer.

I've built an example proxy for how to leverage this prefilling technique to allow for a max_thinking_chars parameter. This is currently working using the template I shared and Qwen/QwQ-32B via LM Studio.
https://github.com/tomasmcm/dttm-proxy

Sign up or log in to comment