telugu-bpe / README.md
Saiteja's picture
Update README.md
699607e verified

A newer version of the Gradio SDK is available: 5.13.1

Upgrade
metadata
title: Telugu Tokenizer Demo
emoji: 🔤
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.9.1
app_file: app.py
pinned: false

Telugu Tokenizer Demo

This is a demo of a custom Telugu tokenizer trained on a large corpus of Telugu text. The tokenizer is designed to efficiently handle Telugu text while maintaining high compression ratios.

Features

  • Vocabulary Size: 50,000+ tokens
  • Compression Ratio: >3.0
  • Special Token Handling: Includes [UNK], [CLS], [SEP], [PAD], [MASK]
  • Telugu-specific: Optimized for Telugu character set (Unicode range: \u0C00-\u0C7F)

Usage

  1. Enter Telugu text in the input box
  2. Click "Submit"
  3. View the tokenization results:
    • Tokens
    • Token IDs
    • Number of tokens
    • Text length
    • Compression ratio

Examples

The demo includes several example texts showcasing different aspects of Telugu text:

  • Basic greetings
  • Simple sentences
  • Complex sentences
  • Long paragraphs

Tokenizer Source

The tokenizer is available at: https://huggingface.co./Saiteja/telugu-bpe

Technical Details

  • Built using the 🤗 Tokenizers library
  • Uses WordPiece tokenization with Telugu-specific pre-tokenization rules
  • Handles Telugu Unicode characters effectively
  • Maintains high compression ratio while preserving token interpretability