Telugu Tokenizer

A Unigram tokenizer specifically trained for the Telugu language using a large corpus of Telugu text from Wikipedia and news sources. This tokenizer is designed to efficiently handle Telugu text while maintaining high compression ratios.

Key Features

Tokenizer Statistics

  • Vocabulary Size: 50000 tokens (✓ Exceeds requirement of 5000+)
  • Compression Ratio: 6.77 (✓ Meets requirement of ≥3.0)
  • Average Token Length: 6.26 characters
  • Training Data: 2,500+ Telugu articles
  • Minimum Text Length: 500 characters per article

Model Configuration

  • Architecture: Unigram Language Model
  • Max Piece Length: 128
  • Sub-iterations: 20
  • Initial Vocabulary: 50000 tokens
  • Auto-scaling: Up to 500,000 tokens if needed

Special Tokens

  • <s>: Start of text token
  • </s>: End of text token
  • <unk>: Unknown token
  • <pad>: Padding token
  • <mask>: Mask token (for potential MLM tasks)

Dataset Details

  • Sources:
    • Telugu Wikipedia articles
    • Major Telugu news websites
    • Combined and cleaned text corpus
  • Content: Diverse topics including literature, culture, history, and general knowledge
  • Preprocessing:
    • Removed references and citations
    • Normalized whitespace
    • Filtered short articles
    • Cleaned special characters
    • Combined short texts for better context

Usage

Installation

pip install tokenizers

Basic Usage

from tokenizers import Tokenizer

# Load the tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")

# Encode text
text = "నమస్కారం"  # Hello
encoding = tokenizer.encode(text)

# Get tokens
print("Tokens:", encoding.tokens)
print("Token IDs:", encoding.ids)

Example Outputs

# Input: "తెలుగు భాష చాలా అందమైనది"
# Output tokens: ['తెలుగు', ' భాష', ' చాలా', ' అంద', 'మైన', 'ది']

Technical Details

Tokenizer Configuration

  • Model: Unigram Language Model (SentencePiece-style)
  • Pre-tokenization: ByteLevel + Character-level splitting
  • Decoder: ByteLevel
  • Post-processor: ByteLevel with trimmed offsets

Performance Metrics

  1. Compression Ratio: 6.77
    • Calculated as: total_chars / total_tokens
    • Higher ratio indicates better compression
    • Median ratio: 7.05
  2. Vocabulary Coverage: 50000 unique tokens
    • Includes special tokens
    • Optimized for Telugu language patterns
    • Auto-scales vocabulary size for better compression

Examples

Check examples.json for more tokenization examples with different types of Telugu text, including:

  • Short phrases
  • Complete sentences
  • Long paragraphs
  • Various writing styles

Training Process

The tokenizer was trained using the following steps:

  1. Collected 2,500+ Telugu articles from multiple sources
  2. Cleaned and preprocessed the text
  3. Combined short texts to create better context
  4. Trained Unigram model with initial vocab size of 50,000
  5. Auto-scaled vocabulary if needed for better compression
  6. Validated against requirements
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .

Space using Saiteja/telugu-bpe 1