Telugu Tokenizer
A Unigram tokenizer specifically trained for the Telugu language using a large corpus of Telugu text from Wikipedia and news sources. This tokenizer is designed to efficiently handle Telugu text while maintaining high compression ratios.
Key Features
Tokenizer Statistics
- Vocabulary Size: 50000 tokens (✓ Exceeds requirement of 5000+)
- Compression Ratio: 6.77 (✓ Meets requirement of ≥3.0)
- Average Token Length: 6.26 characters
- Training Data: 2,500+ Telugu articles
- Minimum Text Length: 500 characters per article
Model Configuration
- Architecture: Unigram Language Model
- Max Piece Length: 128
- Sub-iterations: 20
- Initial Vocabulary: 50000 tokens
- Auto-scaling: Up to 500,000 tokens if needed
Special Tokens
<s>
: Start of text token</s>
: End of text token<unk>
: Unknown token<pad>
: Padding token<mask>
: Mask token (for potential MLM tasks)
Dataset Details
- Sources:
- Telugu Wikipedia articles
- Major Telugu news websites
- Combined and cleaned text corpus
- Content: Diverse topics including literature, culture, history, and general knowledge
- Preprocessing:
- Removed references and citations
- Normalized whitespace
- Filtered short articles
- Cleaned special characters
- Combined short texts for better context
Usage
Installation
pip install tokenizers
Basic Usage
from tokenizers import Tokenizer
# Load the tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
# Encode text
text = "నమస్కారం" # Hello
encoding = tokenizer.encode(text)
# Get tokens
print("Tokens:", encoding.tokens)
print("Token IDs:", encoding.ids)
Example Outputs
# Input: "తెలుగు భాష చాలా అందమైనది"
# Output tokens: ['తెలుగు', ' భాష', ' చాలా', ' అంద', 'మైన', 'ది']
Technical Details
Tokenizer Configuration
- Model: Unigram Language Model (SentencePiece-style)
- Pre-tokenization: ByteLevel + Character-level splitting
- Decoder: ByteLevel
- Post-processor: ByteLevel with trimmed offsets
Performance Metrics
- Compression Ratio: 6.77
- Calculated as: total_chars / total_tokens
- Higher ratio indicates better compression
- Median ratio: 7.05
- Vocabulary Coverage: 50000 unique tokens
- Includes special tokens
- Optimized for Telugu language patterns
- Auto-scales vocabulary size for better compression
Examples
Check examples.json
for more tokenization examples with different types of Telugu text, including:
- Short phrases
- Complete sentences
- Long paragraphs
- Various writing styles
Training Process
The tokenizer was trained using the following steps:
- Collected 2,500+ Telugu articles from multiple sources
- Cleaned and preprocessed the text
- Combined short texts to create better context
- Trained Unigram model with initial vocab size of 50,000
- Auto-scaled vocabulary if needed for better compression
- Validated against requirements