fhai50032's picture
Update README.md
15b8b6c verified
metadata
library_name: transformers
tags:
  - tokenizer
license: apache-2.0

Trained on 20x More Tokens than Previous Iterations

Byte Fallback BPE Tokenizer

  • Trained using google/SPM
  • Vocab Size : 72808

Training Args



def get_corpus_iterator():
  dataset = load_dataset("fhai50032/pds-tk-specific-2", split="train")
  shuffled = dataset.shuffle(seed=42)
  for text in shuffled["text"]:
      stripped = text.strip()
      if stripped:
          for i in range(0, len(stripped), 8192):
              yield stripped[i: i+8192]
sentence_iterator=get_corpus_iterator(),
model_prefix=tokenizer_name,
vocab_size=vocab_size,
num_threads=num_threads,
model_type="bpe",
max_sentence_length=8192,
character_coverage=1.0,
byte_fallback=True,
shuffle_input_sentence=True,
remove_extra_whitespaces=False,
normalization_rule_name="identity",

Special Tokens

{'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}

Training Composition:

  • Maths: 550 M "aluncstokes/mathpile_arxiv_subset"

  • Code: 800 M codeparrot/github-code

  • Hinglish : 250 M Abhishekcr448/Hinglish-Everyday-Conversations-1M Maihar/hinglish-80k

  • English : 2 000 M "allenai/c4", "en"

  • Hindi : 2 200 M aloobun/dhpileIN , data_dir='hi'

Evals

Tokenization Efficency (Less is Better)

Tokenizer English Hindi Tamil Bengali Malayalam Telugu Gujarati Punjabi Code_Python Code_Java c++ Math
0 deepseek-ai/DeepSeek-R1 (128k) 338874 22855 48957 39617 73928 40345 101020 79172 5231 2224 7055 5376
1 unsloth/phi-4 (100k) 308645 40456 59750 116122 149889 48689 118335 87413 4809 2110 6529 5573
2 deepseek-ai/DeepSeek-R1-Distill-Llama-8B (128k) 308512 21110 59625 115138 149883 48661 118061 86765 4809 2111 6530 5574
3 unsloth/gemma-2-9b-it(256k) 323335 15916 53913 53402 57219 47610 107925 87222 5948 2569 8639 5871
4 Ornaments/72k-Bilingual-BBPE-TK-SPM (72k) (Old) 366710 11447 61408 94191 97207 50229 117874 90045 8201 4000 13706 5585
> Ornaments/72k-Bilingual-BBPE-TK-SPM-Identity (72k) 330830 10318 59089 93740 92655 44975 109411 87922 7819 3743 12953 5253
6 Ornaments/72k-TK-BBPE-HF (72k) 321274 10813 67585 159985 193813 55654 134397 97063 5225 2263 7090 5150
7 nvidia/Nemotron-4-Mini-Hindi-4B-Instruct (256k) 332271 14327 55473 36615 45783 48270 160115 117174 6186 2732 8861 6136
8 sarvamai/OpenHathi-7B-Hi-v0.1-Base (48k) 370133 15633 67845 120340 105953 68315 159122 113817 6595 2792 9233 6223
9 sarvamai/sarvam-1 (68k) 385386 11257 61396 27348 31822 51463 119666 103344 7331 3068 9724 6864

Encode-Decode

  • Hindi
Input  : ऋतुराज गायकवाड़ (कप्तान), डेवोन कॉनवे, रचिन रविंद्र, राहुल त्रिपाठी, शिवम दुबे, रविंद्र जडेजा, एमएस धोनी (विकेटकीपर), आर अश्विन, मीथाशा पथिराना, खलील अहमद, नूर अहमद।
Tokens: ['▁ऋतुराज', '▁गायकवाड़', '▁(', 'क', 'प्तान', '),', '▁डे', 'व', 'ोन', '▁कॉ', 'न', 'वे', ',', '▁रच', 'िन', '▁रविंद्र', ',', '▁राहुल', '▁त्रिपाठी', ',', '▁शिवम', '▁दुबे', ',', '▁रविंद्र', '▁जडेजा', ',', '▁एमएस', '▁धोनी', '▁(', 'व', 'िकेट', 'कीपर', '),', '▁आर', '▁अश्विन', ',', '▁मी', 'था', 'शा', '▁पथ', 'िर', 'ाना', ',', '▁ख', 'लील', '▁अहमद', ',', '▁नूर', '▁अहमद', '।']
Encoded: [1, 34862, 26967, 435, 61725, 29148, 1099, 1945, 61754, 1769, 1777, 61735, 981, 61750, 18114, 465, 19049, 61750, 4310, 11042, 61750, 21132, 13133, 61750, 19049, 13624, 61750, 18436, 12473, 435, 61754, 1956, 14572, 1099, 2208, 17618, 61750, 3352, 2063, 2500, 12020, 731, 781, 61750, 429, 13121, 9490, 61750, 26786, 9490, 61770]
Len Tokens 51
Decoded: <s> ऋतुराज गायकवाड़ (कप्तान), डेवोन कॉनवे, रचिन रविंद्र, राहुल त्रिपाठी, शिवम दुबे, रविंद्र जडेजा, एमएस धोनी (विकेटकीपर), आर अश्विन, मीथाशा पथिराना, खलील अहमद, नूर अहमद।
  • English
Input  : Bangalore and Chennai have faced each other in 33 matches in IPL. Out of these 33 games, Bangalore have won 11 whereas Chennai have come out victorious on 21 occasion. 1 match ended without a result.
Tokens: ['▁Bangalore', '▁and', '▁Chennai', '▁have', '▁faced', '▁each', '▁other', '▁in', '▁33', '▁matches', '▁in', '▁IPL', '.', '▁Out', '▁of', '▁these', '▁33', '▁games', ',', '▁Bangalore', '▁have', '▁won', '▁11', '▁whereas', '▁Chennai', '▁have', '▁come', '▁out', '▁vict', 'orious', '▁on', '▁21', '▁occasion', '.', '▁1', '▁match', '▁ended', '▁without', '▁a', '▁result', '.']
Encoded: [1, 43579, 317, 42140, 607, 21626, 1872, 1022, 313, 7736, 14838, 313, 9863, 61751, 7363, 319, 1517, 7736, 4837, 61750, 43579, 607, 4817, 1730, 22734, 42140, 607, 2968, 811, 9594, 30189, 395, 3209, 13423, 61751, 385, 5083, 13623, 2675, 262, 1773, 61751]
Len Tokens 42
Decoded: <s> Bangalore and Chennai have faced each other in 33 matches in IPL. Out of these 33 games, Bangalore have won 11 whereas Chennai have come out victorious on 21 occasion. 1 match ended without a result.
  • Math
Input  : % Change the font if you want to, depending on whether
% you're using pdflatex or xelatex/lualatex
% WHEN COMPILING WITH XELATEX PLEASE USE
% xelatex -shell-escape -output-driver="xdvipdfmx -z 0" sample.tex
\iftutex
  % If using xelatex or lualatex:
  \setmainfont{Roboto Slab}
  \setsansfont{Lato}
  \renewcommand{\familydefault}{\sfdefault}
\else
  % If using pdflatex:
  \usepackage[rm]{roboto}
  \usepackage[defaultsans]{lato}
  % \usepackage{sourcesanspro}
  \renewcommand{\familydefault}{\sfdefault}
\fi

Tokens: ['▁%', '▁Change', '▁the', '▁font', '▁if', '▁you', '▁want', '▁to', ',', '▁depending', '▁on', '▁whether', '\n', '%', '▁you', "'", 're', '▁using', '▁pd', 'fl', 'ate', 'x', '▁or', '▁x', 'el', 'ate', 'x', '/', 'l', 'ual', 'ate', 'x', '\n', '%', '▁WH', 'EN', '▁COMP', 'IL', 'ING', '▁WITH', '▁X', 'EL', 'ATE', 'X', '▁PLEASE', '▁USE', '\n', '%', '▁x', 'el', 'ate', 'x', '▁-', 'shell', '-', 'escape', '▁-', 'output', '-', 'driver', '="', 'xd', 'v', 'ip', 'df', 'mx', '▁-', 'z', '▁0', '"', '▁sample', '.', 'tex', '\n\\', 'ift', 'ut', 'ex', '\n', '▁', '▁%', '▁If', '▁using', '▁x', 'el', 'ate', 'x', '▁or', '▁l', 'ual', 'ate', 'x', ':\n', '▁', '▁\\', 'set', 'main', 'font', '{', 'Rob', 'oto', '▁Sl', 'ab', '}\n', '▁', '▁\\', 'sets', 'ans', 'font', '{', 'L', 'ato', '}\n', '▁', '▁\\', 'renew', 'command', '{\\', 'family', 'default', '}{\\', 'sf', 'default', '}\n\\', 'else', '\n', '▁', '▁%', '▁If', '▁using', '▁pd', 'fl', 'ate', 'x', ':\n', '▁', '▁\\', 'us', 'ep', 'ack', 'age', '[', 'rm', ']{', 'rob', 'oto', '}\n', '▁', '▁\\', 'us', 'ep', 'ack', 'age', '[', 'defaults', 'ans', ']{', 'l', 'ato', '}\n', '▁', '▁%', '▁\\', 'us', 'ep', 'ack', 'age', '{', 'sources', 'ans', 'pro', '}\n', '▁', '▁\\', 'renew', 'command', '{\\', 'family', 'default', '}{\\', 'sf', 'default', '}\n\\', 'fi', '\n']
Encoded: [1, 2920, 20717, 273, 9731, 686, 380, 1570, 306, 61750, 11224, 395, 4910, 61755, 61863, 380, 61809, 265, 2138, 32887, 2673, 442, 61792, 506, 1894, 335, 442, 61792, 61804, 61729, 1204, 442, 61792, 61755, 61863, 19877, 1920, 27670, 4809, 3922, 25404, 2470, 8534, 6586, 61859, 61046, 22326, 61755, 61863, 1894, 335, 442, 61792, 777, 34320, 61780, 35727, 777, 9020, 61780, 16819, 696, 25014, 61762, 947, 7497, 35801, 777, 61831, 612, 61798, 10079, 61751, 8032, 1207, 2865, 388, 1096, 61755, 61715, 2920, 1608, 2138, 1894, 335, 442, 61792, 506, 334, 1204, 442, 61792, 1025, 61715, 426, 1106, 6972, 5295, 61782, 32606, 5896, 11751, 403, 499, 61715, 426, 8105, 770, 5295, 61782, 61811, 10464, 499, 61715, 426, 48033, 10843, 557, 16861, 6694, 4661, 5157, 6694, 1495, 7019, 61755, 61715, 2920, 1608, 2138, 32887, 2673, 442, 61792, 1025, 61715, 426, 379, 953, 697, 626, 61846, 1628, 6219, 35451, 5896, 499, 61715, 426, 379, 953, 697, 626, 61846, 41970, 770, 6219, 61729, 10464, 499, 61715, 2920, 426, 379, 953, 697, 626, 61782, 43733, 770, 1194, 499, 61715, 426, 48033, 10843, 557, 16861, 6694, 4661, 5157, 6694, 1495, 12885, 61755]
Len Tokens 185
Decoded: <s> % Change the font if you want to, depending on whether
% you're using pdflatex or xelatex/lualatex
% WHEN COMPILING WITH XELATEX PLEASE USE
% xelatex -shell-escape -output-driver="xdvipdfmx -z 0" sample.tex
\iftutex
  % If using xelatex or lualatex:
  \setmainfont{Roboto Slab}
  \setsansfont{Lato}
  \renewcommand{\familydefault}{\sfdefault}
\else
  % If using pdflatex:
  \usepackage[rm]{roboto}
  \usepackage[defaultsans]{lato}
  % \usepackage{sourcesanspro}
  \renewcommand{\familydefault}{\sfdefault}
\fi
  • Code
Input  : class SentencePieceUnigramTokenizer(BaseTokenizer):
    """SentencePiece Unigram Tokenizer

    Represents the Unigram algorithm, with the pretokenization used by SentencePiece
    """

    def __init__(
        self,
        vocab: Optional[List[Tuple[str, float]]] = None,
        replacement: str = "▁",
        add_prefix_space: bool = True,
    ):
        if vocab is not None:
            # Let Unigram(..) fail if only one of them is None
            tokenizer = Tokenizer(Unigram(vocab))
        else:
            tokenizer = Tokenizer(Unigram())
Tokens: ['▁class', '▁Sentence', 'P', 'iece', 'Un', 'ig', 'ram', 'Token', 'izer', '(', 'Base', 'Token', 'izer', '):\n', '▁', '▁', '▁', '▁"""', 'Sentence', 'P', 'iece', '▁Un', 'ig', 'ram', '▁Token', 'izer', '\n\n', '▁', '▁', '▁', '▁Rep', 'resents', '▁the', '▁Un', 'ig', 'ram', '▁algorithm', ',', '▁with', '▁the', '▁pre', 'token', 'ization', '▁used', '▁by', '▁Sentence', 'P', 'iece', '\n', '▁', '▁', '▁', '▁"""\n\n', '▁', '▁', '▁', '▁def', '▁__', 'init', '__', '(\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁self', ',\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁voc', 'ab', ':', '▁Optional', '[', 'List', '[', 'Tuple', '[', 'str', ',', '▁float', ']]', ']', '▁=', '▁None', ',\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁replacement', ':', '▁str', '▁=', '▁"', '▁"', ',\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁add', '_', 'prefix', '_', 'space', ':', '▁bool', '▁=', '▁True', ',\n', '▁', '▁', '▁', '▁):\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁if', '▁voc', 'ab', '▁is', '▁not', '▁None', ':\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁#', '▁Let', '▁Un', 'ig', 'ram', '(', '..', ')', '▁fail', '▁if', '▁only', '▁one', '▁of', '▁them', '▁is', '▁None', '\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁token', 'izer', '▁=', '▁Token', 'izer', '(', 'Un', 'ig', 'ram', '(', 'voc', 'ab', '))\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁else', ':\n', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁', '▁token', 'izer', '▁=', '▁Token', 'izer', '(', 'Un', 'ig', 'ram', '())']
Encoded: [1, 946, 40517, 61803, 15343, 4952, 367, 1163, 12331, 7159, 61776, 9859, 12331, 7159, 2454, 61715, 61715, 61715, 7606, 59192, 61803, 15343, 2426, 367, 1163, 35304, 7159, 962, 61715, 61715, 61715, 4784, 13312, 273, 2426, 367, 1163, 10857, 61750, 437, 273, 1184, 13584, 2854, 1815, 597, 40517, 61803, 15343, 61755, 61715, 61715, 61715, 23853, 61715, 61715, 61715, 1178, 3771, 2999, 1390, 3488, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 1349, 622, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 26775, 403, 61799, 21329, 61846, 3412, 61846, 39253, 61846, 2572, 61750, 10030, 17241, 61847, 440, 4673, 622, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 13564, 61799, 1944, 440, 591, 591, 622, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 1173, 61768, 17279, 61768, 5529, 61799, 7165, 440, 8620, 622, 61715, 61715, 61715, 39708, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 686, 26775, 403, 366, 618, 4673, 1025, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 1478, 4593, 2426, 367, 1163, 61776, 786, 61775, 6272, 686, 1323, 882, 319, 1136, 366, 4673, 61755, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 14205, 7159, 440, 35304, 7159, 61776, 4952, 367, 1163, 61776, 44969, 403, 3630, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 2335, 1025, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 61715, 14205, 7159, 440, 35304, 7159, 61776, 4952, 367, 1163, 8170]
Len Tokens 226
Decoded: <s> class SentencePieceUnigramTokenizer(BaseTokenizer):
    """SentencePiece Unigram Tokenizer

    Represents the Unigram algorithm, with the pretokenization used by SentencePiece
    """

    def __init__(
        self,
        vocab: Optional[List[Tuple[str, float]]] = None,
        replacement: str = " ",
        add_prefix_space: bool = True,
    ):
        if vocab is not None:
            # Let Unigram(..) fail if only one of them is None
            tokenizer = Tokenizer(Unigram(vocab))
        else:
            tokenizer = Tokenizer(Unigram())
  • Emoji
Input  : 😜🫤☹️😖🤢🤮😇🐻‍❄️🦄🐾🐽🐍🦞🦐🦿🤴🧑‍🦲👨‍🚒👨‍🚀
Tokens: ['▁', '😜', '<0xF0>', '<0x9F>', '<0xAB>', '<0xA4>', '☹', '️', '😖', '🤢', '🤮', '😇', '<0xF0>', '<0x9F>', '<0x90>', '<0xBB>', '\u200d', '❄', '️', '🦄', '🐾', '<0xF0>', '<0x9F>', '<0x90>', '<0xBD>', '🐍', '<0xF0>', '<0x9F>', '<0xA6>', '<0x9E>', '<0xF0>', '<0x9F>', '<0xA6>', '<0x90>', '<0xF0>', '<0x9F>', '<0xA6>', '<0xBF>', '<0xF0>', '<0x9F>', '<0xA4>', '<0xB4>', '<0xF0>', '<0x9F>', '<0xA7>', '<0x91>', '\u200d', '<0xF0>', '<0x9F>', '<0xA6>', '<0xB2>', '👨', '\u200d', '<0xF0>', '<0x9F>', '<0x9A>', '<0x92>', '👨', '\u200d', '🚀']
Encoded: [1, 61715, 64694, 243, 162, 174, 167, 66250, 62096, 68719, 68725, 70665, 68209, 243, 162, 147, 190, 62658, 66107, 62096, 70672, 69452, 243, 162, 147, 192, 66921, 243, 162, 169, 161, 243, 162, 169, 147, 243, 162, 169, 194, 243, 162, 167, 183, 243, 162, 170, 148, 62658, 243, 162, 169, 181, 66362, 62658, 243, 162, 157, 149, 66362, 62658, 62748]
Len Tokens 61
Decoded: <s> 😜🫤☹️😖🤢🤮😇🐻‍❄️🦄🐾🐽🐍🦞🦐🦿🤴🧑‍🦲👨‍🚒👨‍🚀
  • Sanskrit
Input  : ॐ त्र्यम्बकं यजामहे सुगन्धिं पुष्टिवर्धनम् उर्वारुकमिव बन्धनान्मृत्योर्मुक्षीय मामृतात् ॐ. 
Tokens: ['▁ॐ', '▁त्र', '्यम', '्ब', 'क', 'ं', '▁य', 'जाम', 'हे', '▁सुग', 'न्', 'धि', 'ं', '▁पुष्ट', 'िव', 'र्', 'धन', 'म्', '▁उर्', 'वार', 'ुक', 'म', 'िव', '▁ब', 'न्', 'धन', 'ान', '्म', 'ृत', '्यो', 'र्म', 'ु', 'क्ष', 'ीय', '▁माम', 'ृता', 'त्', '▁ॐ', '.', '▁']
Encoded: [1, 29916, 5202, 1347, 9931, 61725, 61738, 411, 8036, 11065, 38930, 1063, 3255, 61738, 61196, 816, 326, 4408, 3315, 10236, 527, 1908, 61742, 816, 289, 1063, 4408, 325, 640, 1803, 4364, 911, 61763, 536, 2770, 940, 27338, 592, 29916, 61751, 61715]
Len Tokens 41
Decoded: <s> ॐ त्र्यम्बकं यजामहे सुगन्धिं पुष्टिवर्धनम् उर्वारुकमिव बन्धनान्मृत्योर्मुक्षीय मामृतात् ॐ.