File size: 1,370 Bytes
e759a70
 
 
 
 
 
699607e
e759a70
 
 
 
a35bc8f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
699607e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
---
title: Telugu Tokenizer Demo
emoji: 🔤
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.9.1
app_file: app.py
pinned: false
---

# Telugu Tokenizer Demo

This is a demo of a custom Telugu tokenizer trained on a large corpus of Telugu text. The tokenizer is designed to efficiently handle Telugu text while maintaining high compression ratios.

## Features

- **Vocabulary Size**: 50,000+ tokens
- **Compression Ratio**: >3.0
- **Special Token Handling**: Includes [UNK], [CLS], [SEP], [PAD], [MASK]
- **Telugu-specific**: Optimized for Telugu character set (Unicode range: \u0C00-\u0C7F)

## Usage

1. Enter Telugu text in the input box
2. Click "Submit"
3. View the tokenization results:
   - Tokens
   - Token IDs
   - Number of tokens
   - Text length
   - Compression ratio

## Examples

The demo includes several example texts showcasing different aspects of Telugu text:
- Basic greetings
- Simple sentences
- Complex sentences
- Long paragraphs

## Tokenizer Source

The tokenizer is available at: [https://huggingface.co./Saiteja/telugu-bpe](https://huggingface.co./Saiteja/telugu-bpe)

## Technical Details

- Built using the 🤗 Tokenizers library
- Uses WordPiece tokenization with Telugu-specific pre-tokenization rules
- Handles Telugu Unicode characters effectively
- Maintains high compression ratio while preserving token interpretability