metadata
library_name: transformers
tags:
- tokenizer
license: apache-2.0
Trained on 20x More Tokens than Previous Iterations
Byte Fallback BPE Tokenizer
- Trained using google/SPM
- Vocab Size :
72808
Training Args
def get_corpus_iterator():
dataset = load_dataset("fhai50032/pds-tk-specific-2", split="train")
shuffled = dataset.shuffle(seed=42)
for text in shuffled["text"]:
stripped = text.strip()
if stripped:
for i in range(0, len(stripped), 8192):
yield stripped[i: i+8192]
sentence_iterator=get_corpus_iterator(),
model_prefix=tokenizer_name,
vocab_size=vocab_size,
num_threads=num_threads,
model_type="bpe",
max_sentence_length=8192,
character_coverage=1.0,
byte_fallback=True,
shuffle_input_sentence=True,
remove_extra_whitespaces=False,
normalization_rule_name="identity",
Special Tokens
{'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}
Training Composition:
Maths: 550 M
"aluncstokes/mathpile_arxiv_subset"
Code: 800 M
codeparrot/github-code
Hinglish : 250 M
Abhishekcr448/Hinglish-Everyday-Conversations-1M
Maihar/hinglish-80k
English : 2 000 M
"allenai/c4", "en"
Hindi : 2 200 M
aloobun/dhpileIN
,data_dir='hi'
Evals
Tokenization Efficency (Less is Better)
Tokenizer | English | Hindi | Tamil | Bengali | Malayalam | Telugu | Gujarati | Punjabi | Code_Python | Code_Java | c++ | Math | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | deepseek-ai/DeepSeek-R1 (128k) | 338874 | 22855 | 48957 | 39617 | 73928 | 40345 | 101020 | 79172 | 5231 | 2224 | 7055 | 5376 |
1 | unsloth/phi-4 (100k) | 308645 | 40456 | 59750 | 116122 | 149889 | 48689 | 118335 | 87413 | 4809 | 2110 | 6529 | 5573 |
2 | deepseek-ai/DeepSeek-R1-Distill-Llama-8B (128k) | 308512 | 21110 | 59625 | 115138 | 149883 | 48661 | 118061 | 86765 | 4809 | 2111 | 6530 | 5574 |
3 | unsloth/gemma-2-9b-it(256k) | 323335 | 15916 | 53913 | 53402 | 57219 | 47610 | 107925 | 87222 | 5948 | 2569 | 8639 | 5871 |
4 | Ornaments/72k-Bilingual-BBPE-TK-SPM (72k) (Old) | 366710 | 11447 | 61408 | 94191 | 97207 | 50229 | 117874 | 90045 | 8201 | 4000 | 13706 | 5585 |
> | Ornaments/72k-Bilingual-BBPE-TK-SPM-Identity (72k) | 330830 | 10318 | 59089 | 93740 | 92655 | 44975 | 109411 | 87922 | 7819 | 3743 | 12953 | 5253 |
6 | Ornaments/72k-TK-BBPE-HF (72k) | 321274 | 10813 | 67585 | 159985 | 193813 | 55654 | 134397 | 97063 | 5225 | 2263 | 7090 | 5150 |
7 | nvidia/Nemotron-4-Mini-Hindi-4B-Instruct (256k) | 332271 | 14327 | 55473 | 36615 | 45783 | 48270 | 160115 | 117174 | 6186 | 2732 | 8861 | 6136 |
8 | sarvamai/OpenHathi-7B-Hi-v0.1-Base (48k) | 370133 | 15633 | 67845 | 120340 | 105953 | 68315 | 159122 | 113817 | 6595 | 2792 | 9233 | 6223 |
9 | sarvamai/sarvam-1 (68k) | 385386 | 11257 | 61396 | 27348 | 31822 | 51463 | 119666 | 103344 | 7331 | 3068 | 9724 | 6864 |
Encode-Decode
- Hindi
Input : ऋतुराज गायकवाड़ (कप्तान), डेवोन कॉनवे, रचिन रविंद्र, राहुल त्रिपाठी, शिवम दुबे, रविंद्र जडेजा, एमएस धोनी (विकेटकीपर), आर अश्विन, मीथाशा पथिराना, खलील अहमद, नूर अहमद।
Tokens: ['▁ऋतुराज', '▁गायकवाड़', '▁(', 'क', 'प्तान', '),', '▁डे', 'व', 'ोन', '▁कॉ', 'न', 'वे', ',', '▁रच', 'िन', '▁रविंद्र', ',', '▁राहुल', '▁त्रिपाठी', ',', '▁शिवम', '▁दुबे', ',', '▁रविंद्र', '▁जडेजा', ',', '▁एमएस', '▁धोनी', '▁(', 'व', 'िकेट', 'कीपर', '),', '▁आर', '▁अश्विन', ',', '▁मी', 'था', 'शा', '▁पथ', 'िर', 'ाना', ',', '▁ख', 'लील', '▁अहमद', ',', '▁नूर', '▁अहमद', '।']
Encoded: [1, 34862, 26967, 435, 61725, 29148, 1099, 1945, 61754, 1769, 1777, 61735, 981, 61750, 18114, 465, 19049, 61750, 4310, 11042, 61750, 21132, 13133, 61750, 19049, 13624, 61750, 18436, 12473, 435, 61754, 1956, 14572, 1099, 2208, 17618, 61750, 3352, 2063, 2500, 12020, 731, 781, 61750, 429, 13121, 9490, 61750, 26786, 9490, 61770]
Len Tokens 51
Decoded: <s> ऋतुराज गायकवाड़ (कप्तान), डेवोन कॉनवे, रचिन रविंद्र, राहुल त्रिपाठी, शिवम दुबे, रविंद्र जडेजा, एमएस धोनी (विकेटकीपर), आर अश्विन, मीथाशा पथिराना, खलील अहमद, नूर अहमद।
- English
Input : Bangalore and Chennai have faced each other in 33 matches in IPL. Out of these 33 games, Bangalore have won 11 whereas Chennai have come out victorious on 21 occasion. 1 match ended without a result.
Tokens: ['▁Bangalore', '▁and', '▁Chennai', '▁have', '▁faced', '▁each', '▁other', '▁in', '▁33', '▁matches', '▁in', '▁IPL', '.', '▁Out', '▁of', '▁these', '▁33', '▁games', ',', '▁Bangalore', '▁have', '▁won', '▁11', '▁whereas', '▁Chennai', '▁have', '▁come', '▁out', '▁vict', 'orious', '▁on', '▁21', '▁occasion', '.', '▁1', '▁match', '▁ended', '▁without', '▁a', '▁result', '.']
Encoded: [1, 43579, 317, 42140, 607, 21626, 1872, 1022, 313, 7736, 14838, 313, 9863, 61751, 7363, 319, 1517, 7736, 4837, 61750, 43579, 607, 4817, 1730, 22734, 42140, 607, 2968, 811, 9594, 30189, 395, 3209, 13423, 61751, 385, 5083, 13623, 2675, 262, 1773, 61751]
Len Tokens 42
Decoded: <s> Bangalore and Chennai have faced each other in 33 matches in IPL. Out of these 33 games, Bangalore have won 11 whereas Chennai have come out victorious on 21 occasion. 1 match ended without a result.
- Math
Input : % Change the font if you want to, depending on whether
% you're using pdflatex or xelatex/lualatex
% WHEN COMPILING WITH XELATEX PLEASE USE
% xelatex -shell-escape -output-driver="xdvipdfmx -z 0" sample.tex
\iftutex
% If using xelatex or lualatex:
\setmainfont{Roboto Slab}
\setsansfont{Lato}
\renewcommand{\familydefault}{\sfdefault}
\else
% If using pdflatex:
\usepackage[rm]{roboto}
\usepackage[defaultsans]{lato}
% \usepackage{sourcesanspro}
\renewcommand{\familydefault}{\sfdefault}
\fi
Tokens: ['▁%', '▁Change', '▁the', '▁font', '▁if', '▁you', '▁want', '▁to', ',', '▁depending', '▁on', '▁whether', '\n', '%', '▁you', "'", 're', '▁using', '▁pd', 'fl', 'ate', 'x', '▁or', '▁x', 'el', 'ate', 'x', '/', 'l', 'ual', 'ate', 'x', '\n', '%', '▁WH', 'EN', '▁COMP', 'IL', 'ING', '▁WITH', '▁X', 'EL', 'ATE', 'X', '▁PLEASE', '▁USE', '\n', '%', '▁x', 'el', 'ate', 'x', '▁-', 'shell', '-', 'escape', '▁-', 'output', '-', 'driver', '="', 'xd', 'v', 'ip', 'df', 'mx', '▁-', 'z', '▁0', '"', '▁sample', '.', 'tex', '\n\\', 'ift', 'ut', 'ex', '\n', '▁', '▁%', '▁If', '▁using', '▁x', 'el', 'ate', 'x', '▁or', '▁l', 'ual', 'ate', 'x', ':\n', '▁', '▁\\', 'set', 'main', 'font', '{', 'Rob', 'oto', '▁Sl', 'ab', '}\n', '▁', '▁\\', 'sets', 'ans', 'font', '{', 'L', 'ato', '}\n', '▁', '▁\\', 'renew', 'command', '{\\', 'family', 'default', '}{\\', 'sf', 'default', '}\n\\', 'else', '\n', '▁', '▁%', '▁If', '▁using', '▁pd', 'fl', 'ate', 'x', ':\n', '▁', '▁\\', 'us', 'ep', 'ack', 'age', '[', 'rm', ']{', 'rob', 'oto', '}\n', '▁', '▁\\', 'us', 'ep', 'ack', 'age', '[', 'defaults', 'ans', ']{', 'l', 'ato', '}\n', '▁', '▁%', '▁\\', 'us', 'ep', 'ack', 'age', '{', 'sources', 'ans', 'pro', '}\n', '▁', '▁\\', 'renew', 'command', '{\\', 'family', 'default', '}{\\', 'sf', 'default', '}\n\\', 'fi', '\n']
Encoded: [1, 2920, 20717, 273, 9731, 686, 380, 1570, 306, 61750, 11224, 395, 4910, 61755, 61863, 380, 61809, 265, 2138, 32887, 2673, 442, 61792, 506, 1894, 335, 442, 61792, 61804, 61729, 1204, 442, 61792, 61755, 61863, 19877, 1920, 27670, 4809, 3922, 25404, 2470, 8534, 6586, 61859, 61046, 22326, 61755, 61863, 1894, 335, 442, 61792, 777, 34320, 61780, 35727, 777, 9020, 61780, 16819, 696, 25014, 61762, 947, 7497, 35801, 777, 61831, 612, 61798, 10079, 61751, 8032, 1207, 2865, 388, 1096, 61755, 61715, 2920, 1608, 2138, 1894, 335, 442, 61792, 506, 334, 1204, 442, 61792, 1025, 61715, 426, 1106, 6972, 5295, 61782, 32606, 5896, 11751, 403, 499, 61715, 426, 8105, 770, 5295, 61782, 61811, 10464, 499, 61715, 426, 48033, 10843, 557, 16861, 6694, 4661, 5157, 6694, 1495, 7019, 61755, 61715, 2920, 1608, 2138, 32887, 2673, 442, 61792, 1025, 61715, 426, 379, 953, 697, 626, 61846, 1628, 6219, 35451, 5896, 499, 61715, 426, 379, 953, 697, 626, 61846, 41970, 770, 6219, 61729, 10464, 499, 61715, 2920, 426, 379, 953, 697, 626, 61782, 43733, 770, 1194, 499, 61715, 426, 48033, 10843, 557, 16861, 6694, 4661, 5157, 6694, 1495, 12885, 61755]
Len Tokens 185
Decoded: <s> % Change the font if you want to, depending on whether
% you're using pdflatex or xelatex/lualatex
% WHEN COMPILING WITH XELATEX PLEASE USE
% xelatex -shell-escape -output-driver="xdvipdfmx -z 0" sample.tex
\iftutex
% If using xelatex or lualatex:
\setmainfont{Roboto Slab}
\setsansfont{Lato}
\renewcommand{\familydefault}{\sfdefault}
\else
% If using pdflatex:
\usepackage[rm]{roboto}
\usepackage[defaultsans]{lato}
% \usepackage{sourcesanspro}
\renewcommand{\familydefault}{\sfdefault}
\fi
- Code
Input : class SentencePieceUnigramTokenizer(BaseTokenizer):
"""SentencePiece Unigram Tokenizer
Represents the Unigram algorithm, with the pretokenization used by SentencePiece
"""
def __init__(
self,
vocab: Optional[List[Tuple[str, float]]] = None,
replacement: str = "▁",
add_prefix_space: bool = True,
):
if vocab is not None:
# Let Unigram(..) fail if only one of them is None
tokenizer = Tokenizer(Unigram(vocab))
else:
tokenizer = Tokenizer(Unigram())
Tokens: ['▁class', '▁Sentence', 'P', 'iece', 'Un', 'ig', 'ram', 'Token', 'izer', '(', 'Base', 'Token', 'izer', '):\n', '▁', '▁', '▁', '▁"""', 'Sentence', 'P', 'iece', '▁Un', 'ig', 'ram', '▁Token', 'izer', '\n\n', '▁', '▁', '▁', '▁Rep', 'resents', '▁the', '▁Un', 'ig', 'ram', '▁algorithm', ',', '▁with', '▁the', '▁pre', 'token', 'ization', '▁used', '▁by', '▁Sentence', 'P', 'iece', '\n', '▁', '▁', '▁', '▁"""\n\n', '▁', '▁', '▁', '▁def', '▁__', 'init', '__', '(\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁self', ',\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁voc', 'ab', ':', '▁Optional', '[', 'List', '[', 'Tuple', '[', 'str', ',', '▁float', ']]', ']', '▁=', '▁None', ',\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁replacement', ':', '▁str', '▁=', '▁"', '▁"', ',\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁add', '_', 'prefix', '_', 'space', ':', '▁bool', '▁=', '▁True', ',\n', '▁', '▁', '▁', '▁):\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁if', '▁voc', 'ab', '▁is', '▁not', '▁None', ':\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁#', '▁Let', '▁Un', 'ig', 'ram', '(', '..', ')', '▁fail', '▁if', '▁only', '▁one', '▁of', '▁them', '▁is', '▁None', '\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁token', 'izer', '▁=', '▁Token', 'izer', '(', 'Un', 'ig', 'ram', '(', 'voc', 'ab', '))\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁else', ':\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁token', 'izer', '▁=', '▁Token', 'izer', '(', 'Un', 'ig', 'ram', '())']
Encoded: [1, 946, 40517, 61803, 15343, 4952, 367, 1163, 12331, 7159, 61776, 9859, 12331, 7159, 2454, 61715, 61715, 61715, 7606, 59192, 61803, 15343, 2426, 367, 1163, 35304, 7159, 962, 61715, 61715, 61715, 4784, 13312, 273, 2426, 367, 1163, 10857, 61750, 437, 273, 1184, 13584, 2854, 1815, 597, 40517, 61803, 15343, 61755, 61715, 61715, 61715, 23853, 61715, 61715, 61715, 1178, 3771, 2999, 1390, 3488, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 1349, 622, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 26775, 403, 61799, 21329, 61846, 3412, 61846, 39253, 61846, 2572, 61750, 10030, 17241, 61847, 440, 4673, 622, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 13564, 61799, 1944, 440, 591, 591, 622, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 1173, 61768, 17279, 61768, 5529, 61799, 7165, 440, 8620, 622, 61715, 61715, 61715, 39708, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 686, 26775, 403, 366, 618, 4673, 1025, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 1478, 4593, 2426, 367, 1163, 61776, 786, 61775, 6272, 686, 1323, 882, 319, 1136, 366, 4673, 61755, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 14205, 7159, 440, 35304, 7159, 61776, 4952, 367, 1163, 61776, 44969, 403, 3630, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 2335, 1025, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 14205, 7159, 440, 35304, 7159, 61776, 4952, 367, 1163, 8170]
Len Tokens 226
Decoded: <s> class SentencePieceUnigramTokenizer(BaseTokenizer):
"""SentencePiece Unigram Tokenizer
Represents the Unigram algorithm, with the pretokenization used by SentencePiece
"""
def __init__(
self,
vocab: Optional[List[Tuple[str, float]]] = None,
replacement: str = " ",
add_prefix_space: bool = True,
):
if vocab is not None:
# Let Unigram(..) fail if only one of them is None
tokenizer = Tokenizer(Unigram(vocab))
else:
tokenizer = Tokenizer(Unigram())
- Emoji
Input : 😜🫤☹️😖🤢🤮😇🐻❄️🦄🐾🐽🐍🦞🦐🦿🤴🧑🦲👨🚒👨🚀
Tokens: ['▁', '😜', '<0xF0>', '<0x9F>', '<0xAB>', '<0xA4>', '☹', '️', '😖', '🤢', '🤮', '😇', '<0xF0>', '<0x9F>', '<0x90>', '<0xBB>', '\u200d', '❄', '️', '🦄', '🐾', '<0xF0>', '<0x9F>', '<0x90>', '<0xBD>', '🐍', '<0xF0>', '<0x9F>', '<0xA6>', '<0x9E>', '<0xF0>', '<0x9F>', '<0xA6>', '<0x90>', '<0xF0>', '<0x9F>', '<0xA6>', '<0xBF>', '<0xF0>', '<0x9F>', '<0xA4>', '<0xB4>', '<0xF0>', '<0x9F>', '<0xA7>', '<0x91>', '\u200d', '<0xF0>', '<0x9F>', '<0xA6>', '<0xB2>', '👨', '\u200d', '<0xF0>', '<0x9F>', '<0x9A>', '<0x92>', '👨', '\u200d', '🚀']
Encoded: [1, 61715, 64694, 243, 162, 174, 167, 66250, 62096, 68719, 68725, 70665, 68209, 243, 162, 147, 190, 62658, 66107, 62096, 70672, 69452, 243, 162, 147, 192, 66921, 243, 162, 169, 161, 243, 162, 169, 147, 243, 162, 169, 194, 243, 162, 167, 183, 243, 162, 170, 148, 62658, 243, 162, 169, 181, 66362, 62658, 243, 162, 157, 149, 66362, 62658, 62748]
Len Tokens 61
Decoded: <s> 😜🫤☹️😖🤢🤮😇🐻❄️🦄🐾🐽🐍🦞🦐🦿🤴🧑🦲👨🚒👨🚀
- Sanskrit
Input : ॐ त्र्यम्बकं यजामहे सुगन्धिं पुष्टिवर्धनम् उर्वारुकमिव बन्धनान्मृत्योर्मुक्षीय मामृतात् ॐ.
Tokens: ['▁ॐ', '▁त्र', '्यम', '्ब', 'क', 'ं', '▁य', 'जाम', 'हे', '▁सुग', 'न्', 'धि', 'ं', '▁पुष्ट', 'िव', 'र्', 'धन', 'म्', '▁उर्', 'वार', 'ुक', 'म', 'िव', '▁ब', 'न्', 'धन', 'ान', '्म', 'ृत', '्यो', 'र्म', 'ु', 'क्ष', 'ीय', '▁माम', 'ृता', 'त्', '▁ॐ', '.', '▁']
Encoded: [1, 29916, 5202, 1347, 9931, 61725, 61738, 411, 8036, 11065, 38930, 1063, 3255, 61738, 61196, 816, 326, 4408, 3315, 10236, 527, 1908, 61742, 816, 289, 1063, 4408, 325, 640, 1803, 4364, 911, 61763, 536, 2770, 940, 27338, 592, 29916, 61751, 61715]
Len Tokens 41
Decoded: <s> ॐ त्र्यम्बकं यजामहे सुगन्धिं पुष्टिवर्धनम् उर्वारुकमिव बन्धनान्मृत्योर्मुक्षीय मामृतात् ॐ.