|
--- |
|
license: apache-2.0 |
|
|
|
widget: |
|
- type: image-to-text |
|
example: |
|
- src: "tiger.jpg" |
|
- prompt: "Describe this image in one sentence." |
|
|
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
base_model: |
|
- nlpconnect/vit-gpt2-image-captioning |
|
tags: |
|
- gpt2 |
|
- image_to_text |
|
- COCO |
|
- image-captioning |
|
|
|
pipeline_tag: image-to-text |
|
--- |
|
|
|
|
|
|
|
# vit-gpt2-image-captioning_COCO_FineTuned |
|
This repository contains the fine-tuned ViT-GPT2 model for image captioning, trained on the COCO dataset. The model combines a Vision Transformer (ViT) for image feature extraction and GPT-2 for text generation to create descriptive captions from images. |
|
|
|
# Model Overview |
|
Model Type: Vision Transformer (ViT) + GPT-2 |
|
Dataset: COCO (Common Objects in Context) |
|
Task: Image Captioning |
|
This model generates captions for input images based on the objects and contexts identified within the images. It has been fine-tuned on the COCO dataset, which includes a wide variety of images with detailed annotations, making it suitable for diverse image captioning tasks. |
|
|
|
# Model Details |
|
The model architecture consists of two main components: |
|
|
|
Vision Transformer (ViT): A powerful image encoder that extracts feature maps from input images. |
|
GPT-2: A language model that generates human-like text, fine-tuned to generate captions based on the extracted image features. |
|
The model has been trained to: |
|
|
|
Recognize objects and scenes from images. |
|
Generate grammatically correct and contextually accurate captions. |
|
Usage |
|
You can use this model for image captioning tasks with the Hugging Face transformers library. Below is a sample code to load the model and generate captions for input images. |
|
|
|
# Installation |
|
|
|
To use this model, you need to install the following libraries: |
|
```python |
|
pip install torch torchvision transformers |
|
from transformers import VisionEncoderDecoderModel, ViTImageProcessor, GPT2Tokenizer |
|
import torch |
|
from PIL import Image |
|
``` |
|
# Load the fine-tuned model and tokenizer |
|
```python |
|
model = VisionEncoderDecoderModel.from_pretrained("ashok2216/vit-gpt2-image-captioning_COCO_FineTuned") |
|
processor = ViTImageProcessor.from_pretrained("ashok2216/vit-gpt2-image-captioning_COCO_FineTuned") |
|
tokenizer = GPT2Tokenizer.from_pretrained("gpt2") |
|
``` |
|
# Preprocess the image |
|
```python |
|
image = Image.open("path_to_image.jpg") |
|
inputs = processor(images=image, return_tensors="pt") |
|
``` |
|
# Generate caption |
|
```python |
|
pixel_values = inputs.pixel_values |
|
output = model.generate(pixel_values) |
|
caption = tokenizer.decode(output[0], skip_special_tokens=True) |
|
|
|
print("Generated Caption:", caption) |
|
``` |
|
# Input Image: |
|
|
|
Generated Caption: |
|
"A group of people walking down the street with umbrellas in their hands." |
|
|
|
# Fine-Tuning Details |
|
Dataset: COCO dataset (common objects in context) |
|
Image Size: 224x224 pixels |
|
Training Time: ~12 hours on a GPU (depending on batch size and hardware) |
|
Fine-Tuning Strategy: We fine-tuned the ViT-GPT2 model for 5 epochs using the COCO training split. |
|
Model Performance |
|
This model performs well on various image captioning benchmarks. However, its performance is highly dependent on the diversity and quality of the input image. It is recommended to fine-tune or retrain the model further for more specific domains if necessary. |
|
|
|
# Limitations |
|
The model might struggle with generating accurate captions for highly ambiguous or abstract images. |
|
It is trained primarily on the COCO dataset and might perform better on images with similar contexts to the training data. |
|
License |
|
This model is licensed under the MIT License. |
|
|
|
# Acknowledgments |
|
COCO Dataset: The model was trained on the COCO dataset, which is widely used for image captioning tasks. |
|
Hugging Face: For providing the platform to share models and facilitate easy usage of transformer-based models. |
|
Contact |
|
For any questions, please contact Ashok Kumar. |