---
library_name: transformers
license: mit
datasets:
- c3rl/IIIT-INDIC-HW-WORDS-Hindi
language:
- ne
metrics:
- cer
base_model:
- google/vit-base-patch16-224-in21k
- amitness/roberta-base-ne
pipeline_tag: image-to-text
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->


## Model Details


# TrOCR Devanagari - Handwritten Text Recognition

## Overview
TrOCR Devanagari is an end-to-end Vision Encoder-Decoder model built to recognize and convert handwritten Devanagari script (specifically for Nepali language) into machine-readable text. It leverages a Vision Transformer (ViT) as the encoder and uses a transformer-based decoder (NepBERT) to produce textual output. This project aims to assist in digitizing handwritten Nepali documents.

## Model Architecture
The model pipeline includes the following steps:
1. **Text Detection:** Extracts regions of interest from scanned handwritten documents.
2. **Image Preprocessing:** Resizes and pads input images to feed into the model.
3. **Text Recognition:** Uses the TrOCR-based Vision Encoder Decoder model to predict handwritten text.
4. **UI Interface (Optional):** Displays the results and enables user interaction with the system.

## Model Information
- **Model Name:** TrOCR Devanagari
- **Developed by:** Anil Paudel, Aayush Puri, Yubaraj Sigdel
- **Language:** Nepali
- **License:** MIT (tentative)
- **Model Type:** Vision Encoder Decoder
- **Repository:** [paudelanil/trocr-devanagari-2](https://huggingface.co./paudelanil/trocr-devanagari-2)
- **Training Data:** IIIT-HW Dataset
- **Evaluation Metric:** CER (Character Error Rate)
- **Hardware Used:** NVIDIA RTX A4500

## Getting Started

### Installation

To use the model, ensure you have the following Python packages installed:
```bash
pip install torch transformers pillow
```

### Preprocessing Function

The image preprocessing function is used to resize images to the target size while maintaining the aspect ratio and padding the remaining space.

```python
from PIL import Image

def preprocess_image(image):
    target_size = (224, 224)
    original_size = image.size

    aspect_ratio = original_size[0] / original_size[1]
    if aspect_ratio > 1:
        new_width = target_size[0]
        new_height = int(target_size[0] / aspect_ratio)
    else:
        new_height = target_size[1]
        new_width = int(target_size[1] * aspect_ratio)

    resized_img = image.resize((new_width, new_height))

    padding_width = target_size[0] - new_width
    padding_height = target_size[1] - new_height
    pad_left = padding_width // 2
    pad_top = padding_height // 2

    pad_image = Image.new('RGB', target_size, (255, 255, 255))
    pad_image.paste(resized_img, (pad_left, pad_top))
    return pad_image
```

### Prediction Code

Here’s how you can use the model for text recognition:

```python
import torch
from PIL import Image
from transformers import AutoTokenizer, VisionEncoderDecoderModel, ViTFeatureExtractor, TrOCRProcessor

# Load the model and processor
tokenizer = AutoTokenizer.from_pretrained("aayushpuri01/TrOCR-Devanagari")
model1 = VisionEncoderDecoderModel.from_pretrained("aayushpuri01/TrOCR-Devanagari")
feature_extractor1 = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224-in21k')
processor1 = TrOCRProcessor(feature_extractor=feature_extractor1, tokenizer=tokenizer)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model1.to(device)

# Prediction function
def predict(image):
    # Preprocess the image
    image = Image.open(image).convert("RGB")
    image = preprocess_image(image)
    pixel_values = processor1(image, return_tensors="pt").pixel_values.to(device)
    
    # Generate text from the image
    generated_ids = model1.generate(pixel_values)
    generated_text = processor1.batch_decode(generated_ids, skip_special_tokens=True)[0]
    
    return generated_text
```

### Usage Example

```python
# Load and predict
image_path = "path_to_your_image.jpg"
predicted_text = predict(image_path)
print("Predicted Text:", predicted_text)
```

## Training Hyperparameters
```python
training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="steps",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    output_dir='/workspace/checkpoint-save/',
    save_total_limit=2,
    logging_steps=2,
    save_steps=1000,
    eval_steps=1000,
    save_strategy="steps",
    load_best_model_at_end=True,
    metric_for_best_model="cer",
    greater_is_better=False,
    num_train_epochs=15
)
```

## License
The model is shared under the MIT license. For details, see the [LICENSE](LICENSE) file.

## Acknowledgments
This model is based on the 🤗 Transformers library, and uses the ViT encoder and NepBERT decoder architecture. Special thanks to the IIIT-HW dataset contributors.

---

Feel free to explore the project and contribute to the repository!