WE are COOKED

Test Log 08 March 2025

First Test:

Mean Perplexity : tested on wikitext-2-raw-v1, ~2k English samples was 1420.7414870547489

Second Test

Evaluated the tokenizer's performance on:

  • Unicode coverage.
  • Token distribution.
  • Tokenization complexity across different scripts.
  • Encoding and decoding capabilities &
  • Edge cases e.g., special characters, numbers, etc.
  • 1k samples: 500 Hindi, 500 English

1. Edge Case Handling

Language Test Type Token Count Unique Tokens
Hindi Script Test 14 13
Unicode Test 21 21
Special Characters 19 19
English Script Test 16 15
Unicode Test 14 14
Special Characters 18 18

2. Unicode Coverage

Language Coverage Ratio Token Count Unique Tokens
Hindi 100% 21 21
English 100% 14 14

3. Complexity

Language Original Length Token Count Avg Token Length Token Diversity
Hindi 49 14 9.07 0.928
English 65 16 4.06 0.937

4. Encoding-Decoding Capabilities


Hindi Analysis:
Original Text: नमस्ते, मैं भारत से हूँ। दिल्ली बहुत बड़ा शहर है।
Token IDs Count: 14
Token Strings: ['नम', 'सà¥įतà¥ĩ', ',', 'Ġमà¥Īà¤Ĥ', 'Ġà¤Ńारत', 'Ġसà¥ĩ', 'Ġहà¥Ĥà¤ģ', '।', 'Ġदिलà¥įलà¥Ģ', 'Ġबहà¥ģत', 'Ġबड़ा', 'Ġशहर', 'Ġहà¥Ī', '।']
Decoded Text: नमस्ते, मैं भारत से हूँ। दिल्ली बहुत बड़ा शहर है।
Text Reconstruction: True

Hindi Analysis:
Original Text: हिंदी भाषा बहुत सुंदर है।
Token IDs Count: 7
Token Strings: ['ह', 'िà¤Ĥदà¥Ģ', 'Ġà¤Ńाषा', 'Ġबहà¥ģत', 'Ġसà¥ģà¤Ĥदर', 'Ġहà¥Ī', '।']
Decoded Text: हिंदी भाषा बहुत सुंदर है।
Text Reconstruction: True

Hindi Analysis:
Original Text: मुझे किताबें पढ़ना पसंद है।
Token IDs Count: 7
Token Strings: ['म', 'à¥ģà¤Ŀà¥ĩ', 'Ġà¤ķिताबà¥ĩà¤Ĥ', 'Ġपढ़ना', 'Ġपसà¤Ĥद', 'Ġहà¥Ī', '।']
Decoded Text: मुझे किताबें पढ़ना पसंद है।
Text Reconstruction: True

Hindi Analysis:
Original Text: यह एक उदाहरण वाक्य है।
Token IDs Count: 6
Token Strings: ['यह', 'Ġà¤ıà¤ķ', 'Ġà¤īदाहरण', 'Ġवाà¤ķà¥įय', 'Ġहà¥Ī', '।']
Decoded Text: यह एक उदाहरण वाक्य है।
Text Reconstruction: True

English Analysis:
Original Text: Hello, I am from India. Delhi is a big city.
Token IDs Count: 13
Token Strings: ['Hello', ',', 'ĠI', 'Ġam', 'Ġfrom', 'ĠIndia', '.', 'ĠDelhi', 'Ġis', 'Ġa', 'Ġbig', 'Ġcity', '.']
Decoded Text: Hello, I am from India. Delhi is a big city.
Text Reconstruction: True

English Analysis:
Original Text: The English language is widely spoken.
Token IDs Count: 7
Token Strings: ['The', 'ĠEnglish', 'Ġlanguage', 'Ġis', 'Ġwidely', 'Ġspoken', '.']
Decoded Text: The English language is widely spoken.
Text Reconstruction: True

English Analysis:
Original Text: I enjoy reading books.
Token IDs Count: 5
Token Strings: ['I', 'Ġenjoy', 'Ġreading', 'Ġbooks', '.']
Decoded Text: I enjoy reading books.
Text Reconstruction: True

English Analysis:
Original Text: This is an example sentence.
Token IDs Count: 6
Token Strings: ['This', 'Ġis', 'Ġan', 'Ġexample', 'Ġsentence', '.']
Decoded Text: This is an example sentence.
Text Reconstruction: True

image/png

image/png

Downloads last month
11
Safetensors
Model size
1.44B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.