File size: 4,183 Bytes
362aa90
 
9715e7b
 
 
 
 
 
 
 
 
7c6424d
2c1852b
7c6424d
 
 
 
 
 
 
3d420e7
 
42c0029
7c6424d
 
 
0fc22c0
7c6424d
 
0fc22c0
7c6424d
 
0fc22c0
7c6424d
 
0fc22c0
7c6424d
 
0fc22c0
362aa90
 
9c65126
362aa90
28d2cd7
a27b459
362aa90
 
a972777
362aa90
d6e1557
a27b459
 
362aa90
a972777
6d04760
362aa90
6d04760
 
28d2cd7
a972777
362aa90
 
 
a27b459
 
 
 
855b3ec
a27b459
 
08bd2bf
 
 
a27b459
 
 
 
 
 
 
 
 
 
 
362aa90
a972777
362aa90
0ad4928
 
074b040
 
51d756b
a27b459
6d04760
362aa90
a972777
362aa90
d6e1557
241b9d4
5b995e0
 
 
 
 
 
0fd9d58
5b995e0
d6e1557
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
library_name: transformers
datasets:
- laicsiifes/flickr30k-pt-br
language:
- pt
metrics:
- bleu
- rouge
- meteor
- bertscore
base_model: laicsiifes/swin-distilbertimbau
pipeline_tag: image-to-text
model-index:
  - name: Swin-DistilBERTimbau
    results:
      - task:
          name: Image Captioning
          type: image-to-text
        dataset:
          name: laicsiifes/flickr30k-pt-br
          type: flickr30k-pt-br
          split: test
        metrics:
        - name: Cider-D
          type: cider
          value: 66.73
        - name: BLEU@4
          type: bleu
          value: 24.65
        - name: ROUGE-L
          type: rouge
          value: 39.98
        - name: METEOR
          type: meteor
          value: 44.71
        - name: BERTScore
          type: bertscore
          value: 72.30
---

# 🎉 Swin-DistilBERTimbau for Brazilian Portuguese Image Captioning

Swin-DistilBERTimbau model trained for image captioning on [Flickr30K Portuguese](https://huggingface.co./datasets/laicsiifes/flickr30k-pt-br) (translated version using Google Translator API)
at resolution 224x224 and max sequence length of 512 tokens.


## 🤖 Model Description

The Swin-DistilBERTimbau is a type of Vision Encoder Decoder which leverage the checkpoints of the [Swin Transformer](https://huggingface.co./microsoft/swin-base-patch4-window7-224)
as encoder and the checkpoints of the [DistilBERTimbau](https://huggingface.co./adalbertojunior/distilbert-portuguese-cased) as decoder.
The encoder checkpoints come from Swin Trasnformer version pre-trained on ImageNet-1k at resolution 224x224.

The code used for training and evaluation is available at: https://github.com/laicsiifes/ved-transformer-caption-ptbr. In this work, Swin-DistilBERTimbau
was trained together with its buddy [Swin-GPorTuguese-2](https://huggingface.co./laicsiifes/swin-gpt2-flickr30k-pt-br).

Other models evaluated didn't achieve performance as high as Swin-DistilBERTimbau and Swin-GPorTuguese-2, namely: DeiT-BERTimbau,
DeiT-DistilBERTimbau, DeiT-GPorTuguese-2, Swin-BERTimbau, ViT-BERTimbau, ViT-DistilBERTimbau and ViT-GPorTuguese-2.

## 🧑‍💻 How to Get Started with the Model

Use the code below to get started with the model.

```python
import requests
from PIL import Image

from transformers import AutoTokenizer, AutoImageProcessor, VisionEncoderDecoderModel

# load a fine-tuned image captioning model and corresponding tokenizer and image processor
model = VisionEncoderDecoderModel.from_pretrained("laicsiifes/swin-distilbertimbau")
tokenizer = AutoTokenizer.from_pretrained("laicsiifes/swin-distilbertimbau")
image_processor = AutoImageProcessor.from_pretrained("laicsiifes/swin-distilbertimbau")

# perform inference on an image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
pixel_values = image_processor(image, return_tensors="pt").pixel_values

# generate caption
generated_ids = model.generate(pixel_values)
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)
```

## 📈 Results

The evaluation metrics Cider-D, BLEU@4, ROUGE-L, METEOR and BERTScore
(using [BERTimbau](https://huggingface.co./neuralmind/bert-base-portuguese-cased)) are abbreviated as C, B@4, RL, M and BS, respectively.

|Model|Training|Evaluation|C|B@4|RL|M|BS|
|:---:|:------:|:--------:|:-----:|:----:|:-----:|:----:|:-------:|
|Swin-DistilBERTimbau|Flickr30K Portuguese|Flickr30K Portuguese|66.73|24.65|39.98|44.71|72.30|
|Swin-GPorTuguese-2|Flickr30K Portuguese|Flickr30K Portuguese|64.71|23.15|39.39|44.36|71.70|

## 📋 BibTeX entry and citation info

```bibtex
@inproceedings{bromonschenkel2024comparative,
                title = "A Comparative Evaluation of Transformer-Based Vision 
                         Encoder-Decoder Models for Brazilian Portuguese Image Captioning",
               author = "Bromonschenkel, Gabriel and Oliveira, Hil{\'a}rio and 
                         Paix{\~a}o, Thiago M.",
            booktitle = "Proceedings...",
         organization = "Conference on Graphics, Patterns and Images, 37. (SIBGRAPI)",
                 year = "2024"
}
```