Hindi Byte Pair Encoding (BPE) Tokenizer

A specialized BPE tokenizer for Hindi text that achieves efficient compression while maintaining linguistic coherence.

Online Demo

Try the tokenizer in your browser: Hindi BPE Tokenizer Demo

Project Overview

This project implements a Byte Pair Encoding (BPE) tokenizer specifically designed for Hindi text. It features:

Efficient trie-based tokenization
Visualization of training progress
Compression ratio optimization
Support for large Hindi text datasets
Hugging Face compatibility

Project Structure

hindi-bpe/ ├── data/ # Dataset directory │ ├── train/ # Training data │ └── valid/ # Validation data ├── tokenizer/ # Saved tokenizer files │ ├── encoder.json # Encoder state │ └── vocab_stats.json # Vocabulary statistics ├── output/ # Visualization outputs ├── byte_pair_encoder.py # Core BPE implementation ├── hindi_bpe.py # Hindi-specific wrapper ├── test_hindi_bpe.py # Test suite └── requirements.txt # Dependencies

Training stats

- Iteration 4500:
- Vocabulary size: 4,477
- Data size: 448,754
- Compression ratio: 3.66
- Max token length: 64

File Descriptions

byte_pair_encoder.py
- Core BPE implementation
- Trie-based tokenization
- Training statistics tracking
- Visualization utilities
hindi_bpe.py
- Hindi-specific tokenizer wrapper
- Text preprocessing
- Model saving/loading
- Compression ratio calculation
app.py
- Interactive web interface
- Real-time tokenization
- Training visualization
- Model parameter tuning
test_hindi_bpe.py
- Test suite for tokenizer
- Performance benchmarks
- Example usage

Installation

- bash
- Clone repository
- git clone https://github.com/yourusername/hindi-bpe.git
- cd hindi-bpe
- pip install -r requirements.txt

Download and prepare dataset

- python download_dataset.py

Web Interface

- streamlit run app.py

Test-

- python test_hindi_bpe.py
- The test suite includes:
- Training pipeline verification
- Compression ratio validation
- Token count requirements
- Encoding/decoding accuracy

Performance Metrics

The tokenizer aims to achieve:
- Vocabulary size < 5000 tokens
- Compression ratio ≥ 3.2
- Fast encoding/decoding
- Memory-efficient operation

Contributing

Fork the repository
Create feature branch
Commit changes
Push to branch
Create Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.