metadata

library_name: transformers
tags:
  - tokenizer
license: apache-2.0

Trained on 20x More Tokens than Previous Iterations

Byte Fallback BPE Tokenizer

Trained using google/SPM
Vocab Size : 72808

Training Args



def get_corpus_iterator():
  dataset = load_dataset("fhai50032/pds-tk-specific-2", split="train")
  shuffled = dataset.shuffle(seed=42)
  for text in shuffled["text"]:
      stripped = text.strip()
      if stripped:
          for i in range(0, len(stripped), 8192):
              yield stripped[i: i+8192]

sentence_iterator=get_corpus_iterator(),
model_prefix=tokenizer_name,
vocab_size=vocab_size,
num_threads=num_threads,
model_type="bpe",
max_sentence_length=8192,
character_coverage=1.0,
byte_fallback=True,
shuffle_input_sentence=True,
remove_extra_whitespaces=False,
normalization_rule_name="identity",

Special Tokens

{'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}

Training Composition:

Maths: 550 M "aluncstokes/mathpile_arxiv_subset"
Code: 800 M codeparrot/github-code
Hinglish : 250 M Abhishekcr448/Hinglish-Everyday-Conversations-1M Maihar/hinglish-80k
English : 2 000 M "allenai/c4", "en"
Hindi : 2 200 M aloobun/dhpileIN , data_dir='hi'

Evals

Tokenization Efficency (Less is Better)

	Tokenizer	English	Hindi	Tamil	Bengali	Malayalam	Telugu	Gujarati	Punjabi	Code_Python	Code_Java	c++	Math
0	deepseek-ai/DeepSeek-R1 (128k)	338874	22855	48957	39617	73928	40345	101020	79172	5231	2224	7055	5376
1	unsloth/phi-4 (100k)	308645	40456	59750	116122	149889	48689	118335	87413	4809	2110	6529	5573
2	deepseek-ai/DeepSeek-R1-Distill-Llama-8B (128k)	308512	21110	59625	115138	149883	48661	118061	86765	4809	2111	6530	5574
3	unsloth/gemma-2-9b-it(256k)	323335	15916	53913	53402	57219	47610	107925	87222	5948	2569	8639	5871
4	Ornaments/72k-Bilingual-BBPE-TK-SPM (72k) (Old)	366710	11447	61408	94191	97207	50229	117874	90045	8201	4000	13706	5585
>	Ornaments/72k-Bilingual-BBPE-TK-SPM-Identity (72k)	330830	10318	59089	93740	92655	44975	109411	87922	7819	3743	12953	5253
6	Ornaments/72k-TK-BBPE-HF (72k)	321274	10813	67585	159985	193813	55654	134397	97063	5225	2263	7090	5150
7	nvidia/Nemotron-4-Mini-Hindi-4B-Instruct (256k)	332271	14327	55473	36615	45783	48270	160115	117174	6186	2732	8861	6136
8	sarvamai/OpenHathi-7B-Hi-v0.1-Base (48k)	370133	15633	67845	120340	105953	68315	159122	113817	6595	2792	9233	6223
9	sarvamai/sarvam-1 (68k)	385386	11257	61396	27348	31822	51463	119666	103344	7331	3068	9724	6864

Encode-Decode

Hindi

Input  : ऋतुराज गायकवाड़ (कप्तान), डेवोन कॉनवे, रचिन रविंद्र, राहुल त्रिपाठी, शिवम दुबे, रविंद्र जडेजा, एमएस धोनी (विकेटकीपर), आर अश्विन, मीथाशा पथिराना, खलील अहमद, नूर अहमद।
Tokens: ['▁ऋतुराज', '▁गायकवाड़', '▁(', 'क', 'प्तान', '),', '▁डे', 'व', 'ोन', '▁कॉ', 'न', 'वे', ',', '▁रच', 'िन', '▁रविंद्र', ',', '▁राहुल', '▁त्रिपाठी', ',', '▁शिवम', '▁दुबे', ',', '▁रविंद्र', '▁जडेजा', ',', '▁एमएस', '▁धोनी', '▁(', 'व', 'िकेट', 'कीपर', '),', '▁आर', '▁अश्विन', ',', '▁मी', 'था', 'शा', '▁पथ', 'िर', 'ाना', ',', '▁ख', 'लील', '▁अहमद', ',', '▁नूर', '▁अहमद', '।']
Encoded: [1, 34862, 26967, 435, 61725, 29148, 1099, 1945, 61754, 1769, 1777, 61735, 981, 61750, 18114, 465, 19049, 61750, 4310, 11042, 61750, 21132, 13133, 61750, 19049, 13624, 61750, 18436, 12473, 435, 61754, 1956, 14572, 1099, 2208, 17618, 61750, 3352, 2063, 2500, 12020, 731, 781, 61750, 429, 13121, 9490, 61750, 26786, 9490, 61770]
Len Tokens 51
Decoded: <s> ऋतुराज गायकवाड़ (कप्तान), डेवोन कॉनवे, रचिन रविंद्र, राहुल त्रिपाठी, शिवम दुबे, रविंद्र जडेजा, एमएस धोनी (विकेटकीपर), आर अश्विन, मीथाशा पथिराना, खलील अहमद, नूर अहमद।

English

Input  : Bangalore and Chennai have faced each other in 33 matches in IPL. Out of these 33 games, Bangalore have won 11 whereas Chennai have come out victorious on 21 occasion. 1 match ended without a result.
Tokens: ['▁Bangalore', '▁and', '▁Chennai', '▁have', '▁faced', '▁each', '▁other', '▁in', '▁33', '▁matches', '▁in', '▁IPL', '.', '▁Out', '▁of', '▁these', '▁33', '▁games', ',', '▁Bangalore', '▁have', '▁won', '▁11', '▁whereas', '▁Chennai', '▁have', '▁come', '▁out', '▁vict', 'orious', '▁on', '▁21', '▁occasion', '.', '▁1', '▁match', '▁ended', '▁without', '▁a', '▁result', '.']
Encoded: [1, 43579, 317, 42140, 607, 21626, 1872, 1022, 313, 7736, 14838, 313, 9863, 61751, 7363, 319, 1517, 7736, 4837, 61750, 43579, 607, 4817, 1730, 22734, 42140, 607, 2968, 811, 9594, 30189, 395, 3209, 13423, 61751, 385, 5083, 13623, 2675, 262, 1773, 61751]
Len Tokens 42
Decoded: <s> Bangalore and Chennai have faced each other in 33 matches in IPL. Out of these 33 games, Bangalore have won 11 whereas Chennai have come out victorious on 21 occasion. 1 match ended without a result.

Math

Input  : % Change the font if you want to, depending on whether
% you're using pdflatex or xelatex/lualatex
% WHEN COMPILING WITH XELATEX PLEASE USE
% xelatex -shell-escape -output-driver="xdvipdfmx -z 0" sample.tex
\iftutex
  % If using xelatex or lualatex:
  \setmainfont{Roboto Slab}
  \setsansfont{Lato}
  \renewcommand{\familydefault}{\sfdefault}
\else
  % If using pdflatex:
  \usepackage[rm]{roboto}
  \usepackage[defaultsans]{lato}
  % \usepackage{sourcesanspro}
  \renewcommand{\familydefault}{\sfdefault}
\fi

Tokens: ['▁%', '▁Change', '▁the', '▁font', '▁if', '▁you', '▁want', '▁to', ',', '▁depending', '▁on', '▁whether', '\n', '%', '▁you', "'", 're', '▁using', '▁pd', 'fl', 'ate', 'x', '▁or', '▁x', 'el', 'ate', 'x', '/', 'l', 'ual', 'ate', 'x', '\n', '%', '▁WH', 'EN', '▁COMP', 'IL', 'ING', '▁WITH', '▁X', 'EL', 'ATE', 'X', '▁PLEASE', '▁USE', '\n', '%', '▁x', 'el', 'ate', 'x', '▁-', 'shell', '-', 'escape', '▁-', 'output', '-', 'driver', '="', 'xd', 'v', 'ip', 'df', 'mx', '▁-', 'z', '▁0', '"', '▁sample', '.', 'tex', '\n\\', 'ift', 'ut', 'ex', '\n', '▁', '▁%', '▁If', '▁using', '▁x', 'el', 'ate', 'x', '▁or', '▁l', 'ual', 'ate', 'x', ':\n', '▁', '▁\\', 'set', 'main', 'font', '{', 'Rob', 'oto', '▁Sl', 'ab', '}\n', '▁', '▁\\', 'sets', 'ans', 'font', '{', 'L', 'ato', '}\n', '▁', '▁\\', 'renew', 'command', '{\\', 'family', 'default', '}{\\', 'sf', 'default', '}\n\\', 'else', '\n', '▁', '▁%', '▁If', '▁using', '▁pd', 'fl', 'ate', 'x', ':\n', '▁', '▁\\', 'us', 'ep', 'ack', 'age', '[', 'rm', ']{', 'rob', 'oto', '}\n', '▁', '▁\\', 'us', 'ep', 'ack', 'age', '[', 'defaults', 'ans', ']{', 'l', 'ato', '}\n', '▁', '▁%', '▁\\', 'us', 'ep', 'ack', 'age', '{', 'sources', 'ans', 'pro', '}\n', '▁', '▁\\', 'renew', 'command', '{\\', 'family', 'default', '}{\\', 'sf', 'default', '}\n\\', 'fi', '\n']
Encoded: [1, 2920, 20717, 273, 9731, 686, 380, 1570, 306, 61750, 11224, 395, 4910, 61755, 61863, 380, 61809, 265, 2138, 32887, 2673, 442, 61792, 506, 1894, 335, 442, 61792, 61804, 61729, 1204, 442, 61792, 61755, 61863, 19877, 1920, 27670, 4809, 3922, 25404, 2470, 8534, 6586, 61859, 61046, 22326, 61755, 61863, 1894, 335, 442, 61792, 777, 34320, 61780, 35727, 777, 9020, 61780, 16819, 696, 25014, 61762, 947, 7497, 35801, 777, 61831, 612, 61798, 10079, 61751, 8032, 1207, 2865, 388, 1096, 61755, 61715, 2920, 1608, 2138, 1894, 335, 442, 61792, 506, 334, 1204, 442, 61792, 1025, 61715, 426, 1106, 6972, 5295, 61782, 32606, 5896, 11751, 403, 499, 61715, 426, 8105, 770, 5295, 61782, 61811, 10464, 499, 61715, 426, 48033, 10843, 557, 16861, 6694, 4661, 5157, 6694, 1495, 7019, 61755, 61715, 2920, 1608, 2138, 32887, 2673, 442, 61792, 1025, 61715, 426, 379, 953, 697, 626, 61846, 1628, 6219, 35451, 5896, 499, 61715, 426, 379, 953, 697, 626, 61846, 41970, 770, 6219, 61729, 10464, 499, 61715, 2920, 426, 379, 953, 697, 626, 61782, 43733, 770, 1194, 499, 61715, 426, 48033, 10843, 557, 16861, 6694, 4661, 5157, 6694, 1495, 12885, 61755]
Len Tokens 185
Decoded: <s> % Change the font if you want to, depending on whether
% you're using pdflatex or xelatex/lualatex
% WHEN COMPILING WITH XELATEX PLEASE USE
% xelatex -shell-escape -output-driver="xdvipdfmx -z 0" sample.tex
\iftutex
  % If using xelatex or lualatex:
  \setmainfont{Roboto Slab}
  \setsansfont{Lato}
  \renewcommand{\familydefault}{\sfdefault}
\else
  % If using pdflatex:
  \usepackage[rm]{roboto}
  \usepackage[defaultsans]{lato}
  % \usepackage{sourcesanspro}
  \renewcommand{\familydefault}{\sfdefault}
\fi

Code

Input  : class SentencePieceUnigramTokenizer(BaseTokenizer):
    """SentencePiece Unigram Tokenizer

    Represents the Unigram algorithm, with the pretokenization used by SentencePiece
    """

    def __init__(
        self,
        vocab: Optional[List[Tuple[str, float]]] = None,
        replacement: str = "▁",
        add_prefix_space: bool = True,
    ):
        if vocab is not None:
            # Let Unigram(..) fail if only one of them is None
            tokenizer = Tokenizer(Unigram(vocab))
        else:
            tokenizer = Tokenizer(Unigram())
Tokens: ['▁class', '▁Sentence', 'P', 'iece', 'Un', 'ig', 'ram', 'Token', 'izer', '(', 'Base', 'Token', 'izer', '):\n', '▁', '▁', '▁', '▁"""', 'Sentence', 'P', 'iece', '▁Un', 'ig', 'ram', '▁Token', 'izer', '\n\n', '▁', '▁', '▁', '▁Rep', 'resents', '▁the', '▁Un', 'ig', 'ram', '▁algorithm', ',', '▁with', '▁the', '▁pre', 'token', 'ization', '▁used', '▁by', '▁Sentence', 'P', 'iece', '\n', '▁', '▁', '▁', '▁"""\n\n', '▁', '▁', '▁', '▁def', '▁__', 'init', '__', '(\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁self', ',\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁voc', 'ab', ':', '▁Optional', '[', 'List', '[', 'Tuple', '[', 'str', ',', '▁float', ']]', ']', '▁=', '▁None', ',\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁replacement', ':', '▁str', '▁=', '▁"', '▁"', ',\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁add', '_', 'prefix', '_', 'space', ':', '▁bool', '▁=', '▁True', ',\n', '▁', '▁', '▁', '▁):\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁if', '▁voc', 'ab', '▁is', '▁not', '▁None', ':\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁#', '▁Let', '▁Un', 'ig', 'ram', '(', '..', ')', '▁fail', '▁if', '▁only', '▁one', '▁of', '▁them', '▁is', '▁None', '\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁token', 'izer', '▁=', '▁Token', 'izer', '(', 'Un', 'ig', 'ram', '(', 'voc', 'ab', '))\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁else', ':\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁token', 'izer', '▁=', '▁Token', 'izer', '(', 'Un', 'ig', 'ram', '())']
Encoded: [1, 946, 40517, 61803, 15343, 4952, 367, 1163, 12331, 7159, 61776, 9859, 12331, 7159, 2454, 61715, 61715, 61715, 7606, 59192, 61803, 15343, 2426, 367, 1163, 35304, 7159, 962, 61715, 61715, 61715, 4784, 13312, 273, 2426, 367, 1163, 10857, 61750, 437, 273, 1184, 13584, 2854, 1815, 597, 40517, 61803, 15343, 61755, 61715, 61715, 61715, 23853, 61715, 61715, 61715, 1178, 3771, 2999, 1390, 3488, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 1349, 622, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 26775, 403, 61799, 21329, 61846, 3412, 61846, 39253, 61846, 2572, 61750, 10030, 17241, 61847, 440, 4673, 622, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 13564, 61799, 1944, 440, 591, 591, 622, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 1173, 61768, 17279, 61768, 5529, 61799, 7165, 440, 8620, 622, 61715, 61715, 61715, 39708, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 686, 26775, 403, 366, 618, 4673, 1025, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 1478, 4593, 2426, 367, 1163, 61776, 786, 61775, 6272, 686, 1323, 882, 319, 1136, 366, 4673, 61755, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 14205, 7159, 440, 35304, 7159, 61776, 4952, 367, 1163, 61776, 44969, 403, 3630, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 2335, 1025, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 14205, 7159, 440, 35304, 7159, 61776, 4952, 367, 1163, 8170]
Len Tokens 226
Decoded: <s> class SentencePieceUnigramTokenizer(BaseTokenizer):
    """SentencePiece Unigram Tokenizer

    Represents the Unigram algorithm, with the pretokenization used by SentencePiece
    """

    def __init__(
        self,
        vocab: Optional[List[Tuple[str, float]]] = None,
        replacement: str = " ",
        add_prefix_space: bool = True,
    ):
        if vocab is not None:
            # Let Unigram(..) fail if only one of them is None
            tokenizer = Tokenizer(Unigram(vocab))
        else:
            tokenizer = Tokenizer(Unigram())

Emoji

Input  : 😜🫤☹️😖🤢🤮😇🐻‍❄️🦄🐾🐽🐍🦞🦐🦿🤴🧑‍🦲👨‍🚒👨‍🚀
Tokens: ['▁', '😜', '<0xF0>', '<0x9F>', '<0xAB>', '<0xA4>', '☹', '️', '😖', '🤢', '🤮', '😇', '<0xF0>', '<0x9F>', '<0x90>', '<0xBB>', '\u200d', '❄', '️', '🦄', '🐾', '<0xF0>', '<0x9F>', '<0x90>', '<0xBD>', '🐍', '<0xF0>', '<0x9F>', '<0xA6>', '<0x9E>', '<0xF0>', '<0x9F>', '<0xA6>', '<0x90>', '<0xF0>', '<0x9F>', '<0xA6>', '<0xBF>', '<0xF0>', '<0x9F>', '<0xA4>', '<0xB4>', '<0xF0>', '<0x9F>', '<0xA7>', '<0x91>', '\u200d', '<0xF0>', '<0x9F>', '<0xA6>', '<0xB2>', '👨', '\u200d', '<0xF0>', '<0x9F>', '<0x9A>', '<0x92>', '👨', '\u200d', '🚀']
Encoded: [1, 61715, 64694, 243, 162, 174, 167, 66250, 62096, 68719, 68725, 70665, 68209, 243, 162, 147, 190, 62658, 66107, 62096, 70672, 69452, 243, 162, 147, 192, 66921, 243, 162, 169, 161, 243, 162, 169, 147, 243, 162, 169, 194, 243, 162, 167, 183, 243, 162, 170, 148, 62658, 243, 162, 169, 181, 66362, 62658, 243, 162, 157, 149, 66362, 62658, 62748]
Len Tokens 61
Decoded: <s> 😜🫤☹️😖🤢🤮😇🐻‍❄️🦄🐾🐽🐍🦞🦐🦿🤴🧑‍🦲👨‍🚒👨‍🚀

Sanskrit

Input  : ॐ त्र्यम्बकं यजामहे सुगन्धिं पुष्टिवर्धनम् उर्वारुकमिव बन्धनान्मृत्योर्मुक्षीय मामृतात् ॐ. 
Tokens: ['▁ॐ', '▁त्र', '्यम', '्ब', 'क', 'ं', '▁य', 'जाम', 'हे', '▁सुग', 'न्', 'धि', 'ं', '▁पुष्ट', 'िव', 'र्', 'धन', 'म्', '▁उर्', 'वार', 'ुक', 'म', 'िव', '▁ब', 'न्', 'धन', 'ान', '्म', 'ृत', '्यो', 'र्म', 'ु', 'क्ष', 'ीय', '▁माम', 'ृता', 'त्', '▁ॐ', '.', '▁']
Encoded: [1, 29916, 5202, 1347, 9931, 61725, 61738, 411, 8036, 11065, 38930, 1063, 3255, 61738, 61196, 816, 326, 4408, 3315, 10236, 527, 1908, 61742, 816, 289, 1063, 4408, 325, 640, 1803, 4364, 911, 61763, 536, 2770, 940, 27338, 592, 29916, 61751, 61715]
Len Tokens 41
Decoded: <s> ॐ त्र्यम्बकं यजामहे सुगन्धिं पुष्टिवर्धनम् उर्वारुकमिव बन्धनान्मृत्योर्मुक्षीय मामृतात् ॐ.