Visual BERT finetuned on easy_vqa

This model is a finetuned version of the VisualBERT model on the easy_vqa dataset. The dataset is available at the following github repo

VisualBERT

VisualBERT is a multi-modal vision and language model. It can be used for tasks such as visual question answering, multiple choice and visual reasoning. For more info on VisualBERT, please refer to the documentation

Dataset

The dataset easy_vqa, with which the model was fine-tuned, can be easily installed via the package easy_vqa:

pip install easy_vqa

An instance of the dataset is composed of a question, the answer of the question (a label) and the id of the image related to the question. Each image is 64x64 and contains a shape (rectangle, triangle or circle) filled with a single color (blue, red, green, yellow, black, gray, brown or teal) in a random position.

The questions of the dataset inquire about the shape (e.g. What is the blue shape?), the color of the shape (e.g. What color is the triangle?) and the presence of a particular shape/color in both affermative and negative form (e.g. Is there a red shape?). Therefore, the possible answers to a question are: the three possible shapes, the eight possible colors, yes and no.

More information about the package functions which allow to load the images and the questions can be found in the dataset's repo as well an utility script to generate new instances of the dataset in case Data Augmentation is needed.

How to Use

Load the image processor and the model with the following code:

processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")

model = VisualBertForQuestionAnswering.from_pretrained("daki97/visualbert_finetuned_easy_vqa")

COLAB Demo

An example of the usage of the model with the easy_vqa dataset is available here