Update README.md

33cdd0c verified about 1 month ago

4.92 kB

	---
	library_name: transformers
	license: mit
	datasets:
	- c3rl/IIIT-INDIC-HW-WORDS-Hindi
	language:
	- ne
	metrics:
	- cer
	base_model:
	- google/vit-base-patch16-224-in21k
	- amitness/roberta-base-ne
	pipeline_tag: image-to-text
	---

	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->



	## Model Details


	# TrOCR Devanagari - Handwritten Text Recognition

	## Overview
	TrOCR Devanagari is an end-to-end Vision Encoder-Decoder model built to recognize and convert handwritten Devanagari script (specifically for Nepali language) into machine-readable text. It leverages a Vision Transformer (ViT) as the encoder and uses a transformer-based decoder (NepBERT) to produce textual output. This project aims to assist in digitizing handwritten Nepali documents.

	## Model Architecture
	The model pipeline includes the following steps:
	1. Text Detection: Extracts regions of interest from scanned handwritten documents.
	2. Image Preprocessing: Resizes and pads input images to feed into the model.
	3. Text Recognition: Uses the TrOCR-based Vision Encoder Decoder model to predict handwritten text.
	4. UI Interface (Optional): Displays the results and enables user interaction with the system.

	## Model Information
	- Model Name: TrOCR Devanagari
	- Developed by: Anil Paudel, Aayush Puri, Yubaraj Sigdel
	- Language: Nepali
	- License: MIT (tentative)
	- Model Type: Vision Encoder Decoder
	- Repository: [paudelanil/trocr-devanagari-2](https://huggingface.co./paudelanil/trocr-devanagari-2)
	- Training Data: IIIT-HW Dataset
	- Evaluation Metric: CER (Character Error Rate)
	- Hardware Used: NVIDIA RTX A4500

	## Getting Started

	### Installation

	To use the model, ensure you have the following Python packages installed:
	```bash
	pip install torch transformers pillow
	```

	### Preprocessing Function

	The image preprocessing function is used to resize images to the target size while maintaining the aspect ratio and padding the remaining space.

	```python
	from PIL import Image

	def preprocess_image(image):
	target_size = (224, 224)
	original_size = image.size

	aspect_ratio = original_size[0] / original_size[1]
	if aspect_ratio > 1:
	new_width = target_size[0]
	new_height = int(target_size[0] / aspect_ratio)
	else:
	new_height = target_size[1]
	new_width = int(target_size[1] * aspect_ratio)

	resized_img = image.resize((new_width, new_height))

	padding_width = target_size[0] - new_width
	padding_height = target_size[1] - new_height
	pad_left = padding_width // 2
	pad_top = padding_height // 2

	pad_image = Image.new('RGB', target_size, (255, 255, 255))
	pad_image.paste(resized_img, (pad_left, pad_top))
	return pad_image
	```

	### Prediction Code

	Here’s how you can use the model for text recognition:

	```python
	import torch
	from PIL import Image
	from transformers import AutoTokenizer, VisionEncoderDecoderModel, ViTFeatureExtractor, TrOCRProcessor

	# Load the model and processor
	tokenizer = AutoTokenizer.from_pretrained("aayushpuri01/TrOCR-Devanagari")
	model1 = VisionEncoderDecoderModel.from_pretrained("aayushpuri01/TrOCR-Devanagari")
	feature_extractor1 = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224-in21k')
	processor1 = TrOCRProcessor(feature_extractor=feature_extractor1, tokenizer=tokenizer)

	device = 'cuda' if torch.cuda.is_available() else 'cpu'
	model1.to(device)

	# Prediction function
	def predict(image):
	# Preprocess the image
	image = Image.open(image).convert("RGB")
	image = preprocess_image(image)
	pixel_values = processor1(image, return_tensors="pt").pixel_values.to(device)

	# Generate text from the image
	generated_ids = model1.generate(pixel_values)
	generated_text = processor1.batch_decode(generated_ids, skip_special_tokens=True)[0]

	return generated_text
	```

	### Usage Example

	```python
	# Load and predict
	image_path = "path_to_your_image.jpg"
	predicted_text = predict(image_path)
	print("Predicted Text:", predicted_text)
	```

	## Training Hyperparameters
	```python
	training_args = Seq2SeqTrainingArguments(
	predict_with_generate=True,
	evaluation_strategy="steps",
	per_device_train_batch_size=32,
	per_device_eval_batch_size=32,
	output_dir='/workspace/checkpoint-save/',
	save_total_limit=2,
	logging_steps=2,
	save_steps=1000,
	eval_steps=1000,
	save_strategy="steps",
	load_best_model_at_end=True,
	metric_for_best_model="cer",
	greater_is_better=False,
	num_train_epochs=15
	)
	```

	## License
	The model is shared under the MIT license. For details, see the [LICENSE](LICENSE) file.

	## Acknowledgments
	This model is based on the 🤗 Transformers library, and uses the ViT encoder and NepBERT decoder architecture. Special thanks to the IIIT-HW dataset contributors.

	---

	Feel free to explore the project and contribute to the repository!