Spaces:

Saiteja
/

telugu-bpe

Sleeping

File size: 1,370 Bytes

---
title: Telugu Tokenizer Demo
emoji: 🔤
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.9.1
app_file: app.py
pinned: false
---

# Telugu Tokenizer Demo

This is a demo of a custom Telugu tokenizer trained on a large corpus of Telugu text. The tokenizer is designed to efficiently handle Telugu text while maintaining high compression ratios.

## Features

- **Vocabulary Size**: 50,000+ tokens
- **Compression Ratio**: >3.0
- **Special Token Handling**: Includes [UNK], [CLS], [SEP], [PAD], [MASK]
- **Telugu-specific**: Optimized for Telugu character set (Unicode range: \u0C00-\u0C7F)

## Usage

1. Enter Telugu text in the input box
2. Click "Submit"
3. View the tokenization results:
   - Tokens
   - Token IDs
   - Number of tokens
   - Text length
   - Compression ratio

## Examples

The demo includes several example texts showcasing different aspects of Telugu text:
- Basic greetings
- Simple sentences
- Complex sentences
- Long paragraphs

## Tokenizer Source

The tokenizer is available at: [https://huggingface.co./Saiteja/telugu-bpe](https://huggingface.co./Saiteja/telugu-bpe)

## Technical Details

- Built using the 🤗 Tokenizers library
- Uses WordPiece tokenization with Telugu-specific pre-tokenization rules
- Handles Telugu Unicode characters effectively
- Maintains high compression ratio while preserving token interpretability