|
--- |
|
library_name: transformers |
|
license: mit |
|
datasets: |
|
- c3rl/IIIT-INDIC-HW-WORDS-Hindi |
|
language: |
|
- ne |
|
metrics: |
|
- cer |
|
base_model: |
|
- google/vit-base-patch16-224-in21k |
|
- amitness/roberta-base-ne |
|
pipeline_tag: image-to-text |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
|
|
|
|
## Model Details |
|
|
|
|
|
# TrOCR Devanagari - Handwritten Text Recognition |
|
|
|
## Overview |
|
TrOCR Devanagari is an end-to-end Vision Encoder-Decoder model built to recognize and convert handwritten Devanagari script (specifically for Nepali language) into machine-readable text. It leverages a Vision Transformer (ViT) as the encoder and uses a transformer-based decoder (NepBERT) to produce textual output. This project aims to assist in digitizing handwritten Nepali documents. |
|
|
|
## Model Architecture |
|
The model pipeline includes the following steps: |
|
1. **Text Detection:** Extracts regions of interest from scanned handwritten documents. |
|
2. **Image Preprocessing:** Resizes and pads input images to feed into the model. |
|
3. **Text Recognition:** Uses the TrOCR-based Vision Encoder Decoder model to predict handwritten text. |
|
4. **UI Interface (Optional):** Displays the results and enables user interaction with the system. |
|
|
|
## Model Information |
|
- **Model Name:** TrOCR Devanagari |
|
- **Developed by:** Anil Paudel, Aayush Puri, Yubaraj Sigdel |
|
- **Language:** Nepali |
|
- **License:** MIT (tentative) |
|
- **Model Type:** Vision Encoder Decoder |
|
- **Repository:** [paudelanil/trocr-devanagari-2](https://huggingface.co./paudelanil/trocr-devanagari-2) |
|
- **Training Data:** IIIT-HW Dataset |
|
- **Evaluation Metric:** CER (Character Error Rate) |
|
- **Hardware Used:** NVIDIA RTX A4500 |
|
|
|
## Getting Started |
|
|
|
### Installation |
|
|
|
To use the model, ensure you have the following Python packages installed: |
|
```bash |
|
pip install torch transformers pillow |
|
``` |
|
|
|
### Preprocessing Function |
|
|
|
The image preprocessing function is used to resize images to the target size while maintaining the aspect ratio and padding the remaining space. |
|
|
|
```python |
|
from PIL import Image |
|
|
|
def preprocess_image(image): |
|
target_size = (224, 224) |
|
original_size = image.size |
|
|
|
aspect_ratio = original_size[0] / original_size[1] |
|
if aspect_ratio > 1: |
|
new_width = target_size[0] |
|
new_height = int(target_size[0] / aspect_ratio) |
|
else: |
|
new_height = target_size[1] |
|
new_width = int(target_size[1] * aspect_ratio) |
|
|
|
resized_img = image.resize((new_width, new_height)) |
|
|
|
padding_width = target_size[0] - new_width |
|
padding_height = target_size[1] - new_height |
|
pad_left = padding_width // 2 |
|
pad_top = padding_height // 2 |
|
|
|
pad_image = Image.new('RGB', target_size, (255, 255, 255)) |
|
pad_image.paste(resized_img, (pad_left, pad_top)) |
|
return pad_image |
|
``` |
|
|
|
### Prediction Code |
|
|
|
Here’s how you can use the model for text recognition: |
|
|
|
```python |
|
import torch |
|
from PIL import Image |
|
from transformers import AutoTokenizer, VisionEncoderDecoderModel, ViTFeatureExtractor, TrOCRProcessor |
|
|
|
# Load the model and processor |
|
tokenizer = AutoTokenizer.from_pretrained("aayushpuri01/TrOCR-Devanagari") |
|
model1 = VisionEncoderDecoderModel.from_pretrained("aayushpuri01/TrOCR-Devanagari") |
|
feature_extractor1 = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224-in21k') |
|
processor1 = TrOCRProcessor(feature_extractor=feature_extractor1, tokenizer=tokenizer) |
|
|
|
device = 'cuda' if torch.cuda.is_available() else 'cpu' |
|
model1.to(device) |
|
|
|
# Prediction function |
|
def predict(image): |
|
# Preprocess the image |
|
image = Image.open(image).convert("RGB") |
|
image = preprocess_image(image) |
|
pixel_values = processor1(image, return_tensors="pt").pixel_values.to(device) |
|
|
|
# Generate text from the image |
|
generated_ids = model1.generate(pixel_values) |
|
generated_text = processor1.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
|
|
return generated_text |
|
``` |
|
|
|
### Usage Example |
|
|
|
```python |
|
# Load and predict |
|
image_path = "path_to_your_image.jpg" |
|
predicted_text = predict(image_path) |
|
print("Predicted Text:", predicted_text) |
|
``` |
|
|
|
## Training Hyperparameters |
|
```python |
|
training_args = Seq2SeqTrainingArguments( |
|
predict_with_generate=True, |
|
evaluation_strategy="steps", |
|
per_device_train_batch_size=32, |
|
per_device_eval_batch_size=32, |
|
output_dir='/workspace/checkpoint-save/', |
|
save_total_limit=2, |
|
logging_steps=2, |
|
save_steps=1000, |
|
eval_steps=1000, |
|
save_strategy="steps", |
|
load_best_model_at_end=True, |
|
metric_for_best_model="cer", |
|
greater_is_better=False, |
|
num_train_epochs=15 |
|
) |
|
``` |
|
|
|
## License |
|
The model is shared under the MIT license. For details, see the [LICENSE](LICENSE) file. |
|
|
|
## Acknowledgments |
|
This model is based on the 🤗 Transformers library, and uses the ViT encoder and NepBERT decoder architecture. Special thanks to the IIIT-HW dataset contributors. |
|
|
|
--- |
|
|
|
Feel free to explore the project and contribute to the repository! |