Telugu Tokenizer

A Unigram tokenizer specifically trained for the Telugu language using a large corpus of Telugu text from Wikipedia and news sources. This tokenizer is designed to efficiently handle Telugu text while maintaining high compression ratios.

Key Features

Tokenizer Statistics

Vocabulary Size: 50000 tokens (✓ Exceeds requirement of 5000+)
Compression Ratio: 6.77 (✓ Meets requirement of ≥3.0)
Average Token Length: 6.26 characters
Training Data: 2,500+ Telugu articles
Minimum Text Length: 500 characters per article

Model Configuration

Architecture: Unigram Language Model
Max Piece Length: 128
Sub-iterations: 20
Initial Vocabulary: 50000 tokens
Auto-scaling: Up to 500,000 tokens if needed

Special Tokens

<s>: Start of text token
</s>: End of text token
<unk>: Unknown token
<pad>: Padding token
<mask>: Mask token (for potential MLM tasks)

Dataset Details

Sources:
- Telugu Wikipedia articles
- Major Telugu news websites
- Combined and cleaned text corpus
Content: Diverse topics including literature, culture, history, and general knowledge
Preprocessing:
- Removed references and citations
- Normalized whitespace
- Filtered short articles
- Cleaned special characters
- Combined short texts for better context

Usage

Installation

pip install tokenizers

Basic Usage

from tokenizers import Tokenizer

# Load the tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")

# Encode text
text = "నమస్కారం"  # Hello
encoding = tokenizer.encode(text)

# Get tokens
print("Tokens:", encoding.tokens)
print("Token IDs:", encoding.ids)

Example Outputs

# Input: "తెలుగు భాష చాలా అందమైనది"
# Output tokens: ['తెలుగు', ' భాష', ' చాలా', ' అంద', 'మైన', 'ది']

Technical Details

Tokenizer Configuration

Model: Unigram Language Model (SentencePiece-style)
Pre-tokenization: ByteLevel + Character-level splitting
Decoder: ByteLevel
Post-processor: ByteLevel with trimmed offsets

Performance Metrics

Compression Ratio: 6.77
- Calculated as: total_chars / total_tokens
- Higher ratio indicates better compression
- Median ratio: 7.05
Vocabulary Coverage: 50000 unique tokens
- Includes special tokens
- Optimized for Telugu language patterns
- Auto-scales vocabulary size for better compression

Examples

Check examples.json for more tokenization examples with different types of Telugu text, including:

Short phrases
Complete sentences
Long paragraphs
Various writing styles

Training Process

The tokenizer was trained using the following steps:

Collected 2,500+ Telugu articles from multiple sources
Cleaned and preprocessed the text
Combined short texts to create better context
Trained Unigram model with initial vocab size of 50,000
Auto-scaled vocabulary if needed for better compression
Validated against requirements

Saiteja
/

telugu-bpe