Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
5.13.1
metadata
title: Telugu Tokenizer Demo
emoji: 🔤
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.9.1
app_file: app.py
pinned: false
Telugu Tokenizer Demo
This is a demo of a custom Telugu tokenizer trained on a large corpus of Telugu text. The tokenizer is designed to efficiently handle Telugu text while maintaining high compression ratios.
Features
- Vocabulary Size: 50,000+ tokens
- Compression Ratio: >3.0
- Special Token Handling: Includes [UNK], [CLS], [SEP], [PAD], [MASK]
- Telugu-specific: Optimized for Telugu character set (Unicode range: \u0C00-\u0C7F)
Usage
- Enter Telugu text in the input box
- Click "Submit"
- View the tokenization results:
- Tokens
- Token IDs
- Number of tokens
- Text length
- Compression ratio
Examples
The demo includes several example texts showcasing different aspects of Telugu text:
- Basic greetings
- Simple sentences
- Complex sentences
- Long paragraphs
Tokenizer Source
The tokenizer is available at: https://huggingface.co./Saiteja/telugu-bpe
Technical Details
- Built using the 🤗 Tokenizers library
- Uses WordPiece tokenization with Telugu-specific pre-tokenization rules
- Handles Telugu Unicode characters effectively
- Maintains high compression ratio while preserving token interpretability