|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
tags: |
|
- visual_bert |
|
- vqa |
|
- easy_vqa |
|
--- |
|
# Visual BERT finetuned on easy_vqa |
|
This model is a finetuned version of the VisualBERT model on the easy_vqa dataset. The dataset is available at the following [github repo](https://github.com/vzhou842/easy-VQA/tree/master/easy_vqa) |
|
|
|
## VisualBERT |
|
VisualBERT is a multi-modal vision and language model. It can be used for tasks such as visual question answering, multiple choice and visual reasoning. |
|
For more info on VisualBERT, please refer to the [documentation](https://huggingface.co./docs/transformers/model_doc/visual_bert#overview) |
|
|
|
## Dataset |
|
The dataset easy_vqa, with which the model was fine-tuned, can be easily installed via the package easy_vqa: |
|
```python |
|
pip install easy_vqa |
|
``` |
|
|
|
An instance of the dataset is composed of a question, the answer of the question (a label) and the id of the image related to the question. |
|
Each image is 64x64 and contains a shape (rectangle, triangle or circle) filled with a single color (blue, red, green, yellow, black, gray, brown or teal) |
|
in a random position. |
|
|
|
The questions of the dataset inquire about the shape (e.g. What is the blue shape?), the color of the shape (e.g. What color is the triangle?) |
|
and the presence of a particular shape/color in both affermative and negative form (e.g. Is there a red shape?). |
|
Therefore, the possible answers to a question are: the three possible shapes, the eight possible colors, yes and no. |
|
|
|
More information about the package functions which allow to load the images and the questions can be found in the dataset's [repo](https://github.com/vzhou842/easy-VQA/tree/master/easy_vqa) |
|
as well an utility script to generate new instances of the dataset in case Data Augmentation is needed. |
|
|
|
## How to Use |
|
Load the image processor and the model with the following code: |
|
```python |
|
processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa") |
|
|
|
model = VisualBertForQuestionAnswering.from_pretrained("daki97/visualbert_finetuned_easy_vqa") |
|
``` |
|
|
|
## COLAB Demo |
|
An example of the usage of the model with the easy_vqa dataset is available [here](https://colab.research.google.com/drive/1yQfmz6wiSasRl6z-DmP-X403r3lZFqQS#scrollTo=HeVnH8BKkYCI) |