Add chat template (#27)

- Update README (75452cee6bfeadf3595db327ad642f058c3264f0)
- Add chat template (69a339b1e8b0bad0db1bb714bfdb851854a3fff8)

Files changed (2) hide show

README.md +14 -1
tokenizer_config.json +1 -0

README.md CHANGED Viewed

@@ -107,6 +107,19 @@ assert tokens == [1, 7596, 1247, 28747, 26256, 2936, 7653, 1413, 334, 1680, 3200
 </details>
 ## Comparison with [X.AI Grok models](https://x.ai/)
 Hey @elonmusk, I just wanted to let you know that I've recently come across your new model, Grok, and I must say, I'm quite impressed! With 33 billion parameters and all, you've really outdone yourself. But, I've got some news for you - I've outperformed Grok with my humble 7 billion parameters! Isn't that wild? I mean, who would have thought that a model with fewer parameters could be just as witty and humorous as Grok?
@@ -190,4 +203,4 @@ We extend our heartfelt gratitude to AutoMeta and caesus from Alignment Lab AI,
 Special thanks go to Changling Liu from GPT Desk Pte. Ltd., Qiying Yu at Tsinghua University, Baochang Ma, and Hao Wan from 01.AI company for their generous provision of resources. We are also deeply grateful to Jianxiong Li and Peng Li at Tsinghua University for their insightful discussions.
-Furthermore, we appreciate the developers behind the following projects for their significant contributions to our research: [Mistral](https://mistral.ai/), [Chain-of-Thought Hub](https://github.com/FranxYao/chain-of-thought-hub), [Llama 2](https://ai.meta.com/llama/), [Self-Instruct](https://arxiv.org/abs/2212.10560), [FastChat (Vicuna)](https://github.com/lm-sys/FastChat), [Alpaca](https://github.com/tatsu-lab/stanford_alpaca.git), and [StarCoder](https://github.com/bigcode-project/starcoder). Their work has been instrumental in driving our research forward.

 </details>
+The GPT4 template is also available as the integrated `tokenizer.chat_template`,
+which can be used instead of manually specifying the template:
+```python
+messages = [
+    {"role": "user", "content": "Hello"},
+    {"role": "assistant", "content": "Hi"},
+    {"role": "user", "content": "How are you today?"}
+]
+tokens = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
+assert tokens == [1, 420, 6316, 28781, 3198, 3123, 1247, 28747, 22557, 32000, 420, 6316, 28781, 3198, 3123, 21631, 28747, 15359, 32000, 420, 6316, 28781, 3198, 3123, 1247, 28747, 1602, 460, 368, 3154, 28804, 32000, 420, 6316, 28781, 3198, 3123, 21631, 28747]
+```
 ## Comparison with [X.AI Grok models](https://x.ai/)
 Hey @elonmusk, I just wanted to let you know that I've recently come across your new model, Grok, and I must say, I'm quite impressed! With 33 billion parameters and all, you've really outdone yourself. But, I've got some news for you - I've outperformed Grok with my humble 7 billion parameters! Isn't that wild? I mean, who would have thought that a model with fewer parameters could be just as witty and humorous as Grok?
 Special thanks go to Changling Liu from GPT Desk Pte. Ltd., Qiying Yu at Tsinghua University, Baochang Ma, and Hao Wan from 01.AI company for their generous provision of resources. We are also deeply grateful to Jianxiong Li and Peng Li at Tsinghua University for their insightful discussions.
+Furthermore, we appreciate the developers behind the following projects for their significant contributions to our research: [Mistral](https://mistral.ai/), [Chain-of-Thought Hub](https://github.com/FranxYao/chain-of-thought-hub), [Llama 2](https://ai.meta.com/llama/), [Self-Instruct](https://arxiv.org/abs/2212.10560), [FastChat (Vicuna)](https://github.com/lm-sys/FastChat), [Alpaca](https://github.com/tatsu-lab/stanford_alpaca.git), and [StarCoder](https://github.com/bigcode-project/starcoder). Their work has been instrumental in driving our research forward.

tokenizer_config.json CHANGED Viewed

@@ -48,6 +48,7 @@
     "<|pad_0|>"
   ],
   "bos_token": "<s>",
   "clean_up_tokenization_spaces": false,
   "eos_token": "<|end_of_turn|>",
   "legacy": true,

     "<|pad_0|>"
   ],
   "bos_token": "<s>",
+  "chat_template": "{{ bos_token }}{% for message in messages %}{{ 'GPT4 Correct ' + message['role'].title() + ': ' + message['content'] + '<|end_of_turn|>'}}{% endfor %}{% if add_generation_prompt %}{{ 'GPT4 Correct Assistant:' }}{% endif %}",
   "clean_up_tokenization_spaces": false,
   "eos_token": "<|end_of_turn|>",
   "legacy": true,