Spaces:

Saiteja
/

telugu-bpe

Sleeping

telugu-bpe / README.md

Update README.md

699607e verified 30 days ago

1.37 kB

	---
	title: Telugu Tokenizer Demo
	emoji: 🔤
	colorFrom: indigo
	colorTo: purple
	sdk: gradio
	sdk_version: 5.9.1
	app_file: app.py
	pinned: false
	---

	# Telugu Tokenizer Demo

	This is a demo of a custom Telugu tokenizer trained on a large corpus of Telugu text. The tokenizer is designed to efficiently handle Telugu text while maintaining high compression ratios.

	## Features

	- Vocabulary Size: 50,000+ tokens
	- Compression Ratio: >3.0
	- Special Token Handling: Includes [UNK], [CLS], [SEP], [PAD], [MASK]
	- Telugu-specific: Optimized for Telugu character set (Unicode range: \u0C00-\u0C7F)

	## Usage

	1. Enter Telugu text in the input box
	2. Click "Submit"
	3. View the tokenization results:
	- Tokens
	- Token IDs
	- Number of tokens
	- Text length
	- Compression ratio

	## Examples

	The demo includes several example texts showcasing different aspects of Telugu text:
	- Basic greetings
	- Simple sentences
	- Complex sentences
	- Long paragraphs

	## Tokenizer Source

	The tokenizer is available at: [https://huggingface.co./Saiteja/telugu-bpe](https://huggingface.co./Saiteja/telugu-bpe)

	## Technical Details

	- Built using the 🤗 Tokenizers library
	- Uses WordPiece tokenization with Telugu-specific pre-tokenization rules
	- Handles Telugu Unicode characters effectively
	- Maintains high compression ratio while preserving token interpretability