--- tags: - generated_from_trainer datasets: - coco metrics: - rouge - bleu model-index: - name: vit-swin-base-224-gpt2-image-captioning results: [] license: mit language: - en pipeline_tag: image-to-text --- # vit-swin-base-224-gpt2-image-captioning This model is a fine-tuned [VisionEncoderDecoder](https://huggingface.co./docs/transformers/model_doc/vision-encoder-decoder) model on 60% of the [COCO2014](https://huggingface.co./datasets/HuggingFaceM4/COCO) dataset. It achieves the following results on the testing set: - Loss: 0.7989 - Rouge1: 53.1153 - Rouge2: 24.2307 - Rougel: 51.5002 - Rougelsum: 51.4983 - Bleu: 17.7765 ## Model description The model was initialized on [microsoft/swin-base-patch4-window7-224-in22k](https://huggingface.co./microsoft/swin-base-patch4-window7-224-in22k) as the vision encoder, the [gpt2](https://huggingface.co./gpt2) as the decoder. ## Intended uses & limitations You can use this model for image captioning only. ## How to use You can either use the simple pipeline API: ```python from transformers import pipeline image_captioner = pipeline("image-to-text", model="Abdou/vit-swin-base-224-gpt2-image-captioning") # infer the caption caption = image_captioner("http://images.cocodataset.org/test-stuff2017/000000000019.jpg")[0]['generated_text'] print(f"caption: {caption}") ``` Or initialize everything for more flexibility: ```python from transformers import VisionEncoderDecoderModel, GPT2TokenizerFast, ViTImageProcessor import torch import os import urllib.parse as parse from PIL import Image import requests # a function to determine whether a string is a URL or not def is_url(string): try: result = parse.urlparse(string) return all([result.scheme, result.netloc, result.path]) except: return False # a function to load an image def load_image(image_path): if is_url(image_path): return Image.open(requests.get(image_path, stream=True).raw) elif os.path.exists(image_path): return Image.open(image_path) # a function to perform inference def get_caption(model, image_processor, tokenizer, image_path): image = load_image(image_path) # preprocess the image img = image_processor(image, return_tensors="pt").to(device) # generate the caption (using greedy decoding by default) output = model.generate(**img) # decode the output caption = tokenizer.batch_decode(output, skip_special_tokens=True)[0] return caption device = "cuda" if torch.cuda.is_available() else "cpu" # load the fine-tuned image captioning model and corresponding tokenizer and image processor model = VisionEncoderDecoderModel.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning").to(device) tokenizer = GPT2TokenizerFast.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning") image_processor = ViTImageProcessor.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning") # target image url = "http://images.cocodataset.org/test-stuff2017/000000000019.jpg" # get the caption caption = get_caption(model, image_processor, tokenizer, url) print(f"caption: {caption}") ``` Output: ``` Two cows laying in a field with a sky background. ``` ## Training procedure You can check [this guide](https://www.thepythoncode.com/article/image-captioning-with-pytorch-and-transformers-in-python) to learn how this model was fine-tuned. ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 5e-05 - train_batch_size: 64 - eval_batch_size: 64 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 2 ### Training results | Training Loss | Epoch | Step | Validation Loss | Rouge1 | Rouge2 | Rougel | Rougelsum | Bleu | Gen Len | |:-------------:|:-----:|:-----:|:---------------:|:-------:|:-------:|:-------:|:---------:|:-------:|:-------:| | 1.0018 | 0.38 | 2000 | 0.8860 | 38.6537 | 13.8145 | 35.3932 | 35.3935 | 8.2448 | 11.2946 | | 0.8827 | 0.75 | 4000 | 0.8395 | 40.0458 | 14.8829 | 36.5321 | 36.5366 | 9.1169 | 11.2946 | | 0.8378 | 1.13 | 6000 | 0.8140 | 41.2736 | 15.9576 | 37.5504 | 37.5512 | 9.871 | 11.2946 | | 0.7913 | 1.51 | 8000 | 0.8012 | 41.6642 | 16.1987 | 37.8786 | 37.8891 | 10.0786 | 11.2946 | | 0.7794 | 1.89 | 10000 | 0.7933 | 41.9119 | 16.3738 | 38.1062 | 38.1292 | 10.288 | 11.2946 | Total training time: ~5 hours on NVIDIA A100 GPU. ### Framework versions - Transformers 4.26.0 - Pytorch 1.13.1+cu116 - Datasets 2.9.0 - Tokenizers 0.13.2