File size: 3,006 Bytes
362aa90 9715e7b b1568d6 2535abf 362aa90 28d2cd7 362aa90 28d2cd7 a27b459 362aa90 a972777 362aa90 d6e1557 a27b459 362aa90 a972777 362aa90 28d2cd7 a972777 362aa90 a27b459 362aa90 a972777 362aa90 d6e1557 074b040 a27b459 a972777 362aa90 a972777 362aa90 d6e1557 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
---
library_name: transformers
datasets:
- laicsiifes/flickr30k-pt-br
language:
- pt
metrics:
- bleu
- rouge
- meteor
- bertscore
base_model:
- microsoft/swin-base-patch4-window7-224
pipeline_tag: text-generation
---
# 🎉 Swin-DistilBERTimbau for Image Captioning
Swin-DistilBERTimbau model trained for image captioning on [Flickr30K Portuguese](https://huggingface.co./datasets/laicsiifes/flickr30k-pt-br) (translated version using Google Translator API)
at resolution 224x224 and max sequence length of 512 tokens.
## 🤖 Model Description
The Swin-DistilBERTimbau is a type of Vision Encoder Decoder which leverage the checkpoints of the [Swin Transformer](https://huggingface.co./microsoft/swin-base-patch4-window7-224)
as encoder and the checkpoints of the [DistilBERTimbau](https://huggingface.co./adalbertojunior/distilbert-portuguese-cased) as decoder.
The encoder checkpoints come from Swin Trasnformer version pre-trained on ImageNet-1k at resolution 224x224.
The code used for training and evaluation is available at: https://github.com/laicsiifes/ved-transformer-caption-ptbr. In this work, Swin-DistilBERTimbau
was trained together with its buddy [Swin-GPorTuguese](https://huggingface.co./laicsiifes/swin-gpt2-flickr30k-pt-br).
Other models evaluated didn't achieve performance as high as Swin-DistilBERTimbau and Swin-GPorTuguese, namely: DeiT-BERTimbau,
DeiT-DistilBERTimbau, DeiT-GPorTuguese, Swin-BERTimbau, ViT-BERTimbau, ViT-DistilBERTimbau and ViT-GPorTuguese.
## 🧑💻 How to Get Started with the Model
Use the code below to get started with the model.
```python
import requests
from PIL import Image
from transformers import AutoTokenizer, ViTImageProcessor, VisionEncoderDecoderModel
# load a fine-tuned image captioning model and corresponding tokenizer and image processor
model = VisionEncoderDecoderModel.from_pretrained("laicsiifes/swin-distilbert-flickr30k-pt-br")
tokenizer = GPT2TokenizerFast.from_pretrained("laicsiifes/swin-distilbert-flickr30k-pt-br")
image_processor = ViTImageProcessor.from_pretrained("laicsiifes/swin-distilbert-flickr30k-pt-br")
# perform inference on an image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
pixel_values = image_processor(image, return_tensors="pt").pixel_values
# generate caption
generated_ids = model.generate(pixel_values)
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)
```
## 📈 Results
The evaluation metrics Cider-D, BLEU@4, ROUGE-L, METEOR and BERTScore are abbreviated as C, B@4, RL, M and BS, respectively.
|Model|Training|Evaluation|C|B@4|RL|M|BS|
|-----|--------|----------|-------|------|-------|------|---------|
|Swin-DistilBERTimbau|Flickr30K Portuguese|Flickr30K Portuguese|66.73|24.65|39.98|44.71|72.30|
|Swin-GPorTuguese|Flickr30K Portuguese|Flickr30K Portuguese|64.71|23.15|39.39|44.36|71.70|
## 📋 BibTeX entry and citation info
```bibtex
Coming Soon
``` |