trocr-devanagari-2 / README.md
paudelanil's picture
Update README.md
33cdd0c verified
---
library_name: transformers
license: mit
datasets:
- c3rl/IIIT-INDIC-HW-WORDS-Hindi
language:
- ne
metrics:
- cer
base_model:
- google/vit-base-patch16-224-in21k
- amitness/roberta-base-ne
pipeline_tag: image-to-text
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
## Model Details
# TrOCR Devanagari - Handwritten Text Recognition
## Overview
TrOCR Devanagari is an end-to-end Vision Encoder-Decoder model built to recognize and convert handwritten Devanagari script (specifically for Nepali language) into machine-readable text. It leverages a Vision Transformer (ViT) as the encoder and uses a transformer-based decoder (NepBERT) to produce textual output. This project aims to assist in digitizing handwritten Nepali documents.
## Model Architecture
The model pipeline includes the following steps:
1. **Text Detection:** Extracts regions of interest from scanned handwritten documents.
2. **Image Preprocessing:** Resizes and pads input images to feed into the model.
3. **Text Recognition:** Uses the TrOCR-based Vision Encoder Decoder model to predict handwritten text.
4. **UI Interface (Optional):** Displays the results and enables user interaction with the system.
## Model Information
- **Model Name:** TrOCR Devanagari
- **Developed by:** Anil Paudel, Aayush Puri, Yubaraj Sigdel
- **Language:** Nepali
- **License:** MIT (tentative)
- **Model Type:** Vision Encoder Decoder
- **Repository:** [paudelanil/trocr-devanagari-2](https://huggingface.co./paudelanil/trocr-devanagari-2)
- **Training Data:** IIIT-HW Dataset
- **Evaluation Metric:** CER (Character Error Rate)
- **Hardware Used:** NVIDIA RTX A4500
## Getting Started
### Installation
To use the model, ensure you have the following Python packages installed:
```bash
pip install torch transformers pillow
```
### Preprocessing Function
The image preprocessing function is used to resize images to the target size while maintaining the aspect ratio and padding the remaining space.
```python
from PIL import Image
def preprocess_image(image):
target_size = (224, 224)
original_size = image.size
aspect_ratio = original_size[0] / original_size[1]
if aspect_ratio > 1:
new_width = target_size[0]
new_height = int(target_size[0] / aspect_ratio)
else:
new_height = target_size[1]
new_width = int(target_size[1] * aspect_ratio)
resized_img = image.resize((new_width, new_height))
padding_width = target_size[0] - new_width
padding_height = target_size[1] - new_height
pad_left = padding_width // 2
pad_top = padding_height // 2
pad_image = Image.new('RGB', target_size, (255, 255, 255))
pad_image.paste(resized_img, (pad_left, pad_top))
return pad_image
```
### Prediction Code
Here’s how you can use the model for text recognition:
```python
import torch
from PIL import Image
from transformers import AutoTokenizer, VisionEncoderDecoderModel, ViTFeatureExtractor, TrOCRProcessor
# Load the model and processor
tokenizer = AutoTokenizer.from_pretrained("aayushpuri01/TrOCR-Devanagari")
model1 = VisionEncoderDecoderModel.from_pretrained("aayushpuri01/TrOCR-Devanagari")
feature_extractor1 = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224-in21k')
processor1 = TrOCRProcessor(feature_extractor=feature_extractor1, tokenizer=tokenizer)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model1.to(device)
# Prediction function
def predict(image):
# Preprocess the image
image = Image.open(image).convert("RGB")
image = preprocess_image(image)
pixel_values = processor1(image, return_tensors="pt").pixel_values.to(device)
# Generate text from the image
generated_ids = model1.generate(pixel_values)
generated_text = processor1.batch_decode(generated_ids, skip_special_tokens=True)[0]
return generated_text
```
### Usage Example
```python
# Load and predict
image_path = "path_to_your_image.jpg"
predicted_text = predict(image_path)
print("Predicted Text:", predicted_text)
```
## Training Hyperparameters
```python
training_args = Seq2SeqTrainingArguments(
predict_with_generate=True,
evaluation_strategy="steps",
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
output_dir='/workspace/checkpoint-save/',
save_total_limit=2,
logging_steps=2,
save_steps=1000,
eval_steps=1000,
save_strategy="steps",
load_best_model_at_end=True,
metric_for_best_model="cer",
greater_is_better=False,
num_train_epochs=15
)
```
## License
The model is shared under the MIT license. For details, see the [LICENSE](LICENSE) file.
## Acknowledgments
This model is based on the 🤗 Transformers library, and uses the ViT encoder and NepBERT decoder architecture. Special thanks to the IIIT-HW dataset contributors.
---
Feel free to explore the project and contribute to the repository!