gchhablani
commited on
Commit
·
fefac71
1
Parent(s):
d6ddf17
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,85 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# CLIP-Vision-BERT Multilingual VQA Model
|
2 |
+
|
3 |
+
Fine-tuned CLIP-Vision-BERT on translated [VQAv2](https://visualqa.org/challenge.html) image-text pairs using sequence classification objective. We translate the dataset to three other languages other than English: French, German, and Spanish using the [MarianMT Models](https://huggingface.co/transformers/model_doc/marian.html). This model is based on the VisualBERT which was introduced in
|
4 |
+
[this paper](https://arxiv.org/abs/1908.03557) and first released in
|
5 |
+
[this repository](https://github.com/uclanlp/visualbert). The output is 3129 class logits, the same classes as used by VisualBERT authors.
|
6 |
+
|
7 |
+
The initial weights are loaded from the Conceptual-12M 60k [checkpoints](https://huggingface.co/flax-community/clip-vision-bert-cc12m-60k).
|
8 |
+
|
9 |
+
We trained the CLIP-Vision-BERT VQA model during community week hosted by Huggingface 🤗 using JAX/Flax.
|
10 |
+
|
11 |
+
## Model description
|
12 |
+
CLIP-Vision-BERT is a modified BERT model which takes in visual embeddings from the CLIP-Vision transformer and concatenates them with BERT textual embeddings before passing them to the self-attention layers of BERT. This is done for deep cross-modal interaction between the two modes.
|
13 |
+
|
14 |
+
## Intended uses & limitations❗️
|
15 |
+
This model is fine-tuned on a multi-translated version of the visual question answering task - [VQA v2](https://visualqa.org/challenge.html). Since VQAv2 is a dataset scraped from the internet, it will involve some biases which will also affect all fine-tuned versions of this model.
|
16 |
+
|
17 |
+
### How to use❓
|
18 |
+
You can use this model directly on visual question answering. You will need to clone the model from [here](https://github.com/gchhablani/multilingual-vqa). An example of usage is shown below:
|
19 |
+
|
20 |
+
```python
|
21 |
+
>>> from torchvision.io import read_image
|
22 |
+
>>> import numpy as np
|
23 |
+
>>> import os
|
24 |
+
>>> from transformers import CLIPProcessor, BertTokenizerFast
|
25 |
+
>>> from model.flax_clip_vision_bert.modeling_clip_vision_bert import FlaxCLIPVisionBertForSequenceClassification
|
26 |
+
>>> image_path = os.path.join('images/val2014', os.listdir('images/val2014')[0])
|
27 |
+
>>> img = read_image(image_path)
|
28 |
+
>>> clip_processor = CLIPProcessor.from_pretrained('openai/clip-vit-base-patch32')
|
29 |
+
ftfy or spacy is not installed using BERT BasicTokenizer instead of ftfy.
|
30 |
+
>>> clip_outputs = clip_processor(images=img)
|
31 |
+
>>> clip_outputs['pixel_values'][0] = clip_outputs['pixel_values'][0].transpose(1,2,0) # Need to transpose images as model expected channel last images.
|
32 |
+
>>> tokenizer = BertTokenizerFast.from_pretrained('bert-base-multilingual-uncased')
|
33 |
+
>>> model = FlaxCLIPVisionBertForSequenceClassification.from_pretrained('flax-community/clip-vision-bert-vqa-ft-6k')
|
34 |
+
>>> text = "Are there teddy bears in the image?"
|
35 |
+
>>> tokens = tokenizer([text], return_tensors="np")
|
36 |
+
>>> pixel_values = np.concatenate([clip_outputs['pixel_values']])
|
37 |
+
>>> outputs = model(pixel_values=pixel_values, **tokens)
|
38 |
+
>>> preds = outputs.logits[0]
|
39 |
+
>>> sorted_indices = np.argsort(preds)[::-1] # Get reverse sorted scores
|
40 |
+
>>> top_5_indices = sorted_indices[:5]
|
41 |
+
>>> top_5_tokens = list(map(model.config.id2label.get,top_5_indices))
|
42 |
+
>>> top_5_scores = preds[top_5_indices]
|
43 |
+
>>> print(dict(zip(top_5_tokens, top_5_scores)))
|
44 |
+
{'yes': 15.809224, 'no': 7.8785815, '<unk>': 4.622649, 'very': 4.511462, 'neither': 3.600822}
|
45 |
+
```
|
46 |
+
|
47 |
+
## Training data 🏋🏻♂️
|
48 |
+
The CLIP-Vision-BERT model was fine-tuned on the translated version of the VQAv2 dataset in four languages using Marian: English, French, German and Spanish. Hence, the dataset is four times the original English questions.
|
49 |
+
|
50 |
+
The dataset questions and image URLs/paths can be downloaded from [flax-community/multilingual-vqa](https://huggingface.co/datasets/flax-community/multilingual-vqa).
|
51 |
+
|
52 |
+
## Data Cleaning 🧹
|
53 |
+
|
54 |
+
Though the original dataset contains 443,757 train and 214,354 validation image-question pairs. We only use the `multiple_choice_answer`. The answers which are not present in the 3129 classes are mapped to the `<unk>` label.
|
55 |
+
|
56 |
+
**Splits**
|
57 |
+
We use the original train-val splits from the VQAv2 dataset. After translation, we get 1,775,028 train image-text pairs, and 857,416 validation image-text pairs.
|
58 |
+
|
59 |
+
## Training procedure 👨🏻💻
|
60 |
+
### Preprocessing
|
61 |
+
The texts are lowercased and tokenized using WordPiece and a shared vocabulary size of approximately 110,000. The beginning of a new document is marked with `[CLS]` and the end of one by `[SEP]`.
|
62 |
+
|
63 |
+
### Fine-tuning
|
64 |
+
The checkpoint of the model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores) **8 v3 TPU cores** for 6k steps with a per device batch size of 128 and a max sequence length of 128. The optimizer used is AdamW with a learning rate of 5e-5, learning rate warmup for 1600 steps, and linear decay of the learning rate after.
|
65 |
+
|
66 |
+
We tracked experiments using TensorBoard. Here is link to main dashboard: [CLIP Vision BERT VQAv2 Fine-tuning Dashboard](https://huggingface.co/flax-community/multilingual-vqa-pt-60k-ft/tensorboard)
|
67 |
+
|
68 |
+
|
69 |
+
#### **Fine-tuning Results 📊**
|
70 |
+
|
71 |
+
The model at this checkpoint reached **eval accuracy of 0.49** on our multilingual VQAv2 dataset.
|
72 |
+
|
73 |
+
|
74 |
+
## Team Members
|
75 |
+
- Gunjan Chhablani [@gchhablani](https://hf.co/gchhablani)
|
76 |
+
- Bhavitvya Malik[@bhavitvyamalik](https://hf.co/bhavitvyamalik)
|
77 |
+
|
78 |
+
## Acknowledgements
|
79 |
+
We thank [Nilakshan Kunananthaseelan](https://huggingface.co/knilakshan20) for helping us whenever he could get a chance. We also thank [Abheesht Sharma](https://huggingface.co/abheesht) for helping in the discussions in the initial phases. [Luke Melas](https://github.com/lukemelas) helped us get the CC-12M data on our TPU-VMs and we are very grateful to him.
|
80 |
+
|
81 |
+
This project would not be possible without the help of [Patrick](https://huggingface.co/patrickvonplaten) and [Suraj](https://huggingface.co/valhalla) who met with us frequently and helped review our approach and guided us throughout the project.
|
82 |
+
|
83 |
+
Huge thanks to Huggingface 🤗 & Google Jax/Flax team for such a wonderful community week and for answering our queries on the Slack channel, and for providing us with the TPU-VMs.
|
84 |
+
|
85 |
+
<img src=https://pbs.twimg.com/media/E443fPjX0AY1BsR.jpg:large>
|