Why should you use this and not the tiktoken included in the orignal model?

  1. This tokenizer is validated with the https://huggingface.co./datasets/xn (all languages) to be encode/decode compatible with dbrx-base tiktoken
  2. Original tokenizer pad the vocabulary to correct size with <extra_N> tokens but encoder never uses them
  3. Original tokenizer use eos as pad token which may confuse trainers to mask out the eos token so model never output eos.
  4. [NOT FIXED: INVESTIGATING] config.json embedding size of "vocab_size": 100352 does not match 100277

modified from original code @ https://huggingface.co./Xenova/dbrx-instruct-tokenizer

Changes:
1. Remove non-base model tokens
2. Keep/Add `<|pad|>` special token to make sure padding can be differentiated from eos/bos.
3. Expose 15 unused/reserved `<|extra_N|>` for use

# pad token
 "100256": {
      "content": "<|pad|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },

# 15 unused/reserved extra tokens
"<|extra_0|>": 100261
"<|extra_1|>": 100262
...
"<|extra_14|>": 100275

DBRX Instruct Tokenizer

A 🤗-compatible version of the DBRX Instruct (adapted from databricks/dbrx-instruct). This means it can be used with Hugging Face libraries including Transformers, Tokenizers, and Transformers.js.

Example usage:

Transformers/Tokenizers

from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained('Xenova/dbrx-instruct-tokenizer')
assert tokenizer.encode('hello world') == [15339, 1917]

Transformers.js

import { AutoTokenizer } from '@xenova/transformers';

const tokenizer = await AutoTokenizer.from_pretrained('Xenova/dbrx-instruct-tokenizer');
const tokens = tokenizer.encode('hello world'); // [15339, 1917]
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.