metadata

title: Telugu Tokenizer Demo
emoji: 🔤
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.9.1
app_file: app.py
pinned: false

Telugu Tokenizer Demo

This is a demo of a custom Telugu tokenizer trained on a large corpus of Telugu text. The tokenizer is designed to efficiently handle Telugu text while maintaining high compression ratios.

Features

Vocabulary Size: 50,000+ tokens
Compression Ratio: >3.0
Special Token Handling: Includes [UNK], [CLS], [SEP], [PAD], [MASK]
Telugu-specific: Optimized for Telugu character set (Unicode range: \u0C00-\u0C7F)

Usage

Enter Telugu text in the input box
Click "Submit"
View the tokenization results:
- Tokens
- Token IDs
- Number of tokens
- Text length
- Compression ratio

Examples

The demo includes several example texts showcasing different aspects of Telugu text:

Basic greetings
Simple sentences
Complex sentences
Long paragraphs

Tokenizer Source

The tokenizer is available at: https://huggingface.co./Saiteja/telugu-bpe

Technical Details

Built using the 🤗 Tokenizers library
Uses WordPiece tokenization with Telugu-specific pre-tokenization rules
Handles Telugu Unicode characters effectively
Maintains high compression ratio while preserving token interpretability