gchhablani commited on
Commit
fefac71
·
1 Parent(s): d6ddf17

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +85 -0
README.md ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CLIP-Vision-BERT Multilingual VQA Model
2
+
3
+ Fine-tuned CLIP-Vision-BERT on translated [VQAv2](https://visualqa.org/challenge.html) image-text pairs using sequence classification objective. We translate the dataset to three other languages other than English: French, German, and Spanish using the [MarianMT Models](https://huggingface.co/transformers/model_doc/marian.html). This model is based on the VisualBERT which was introduced in
4
+ [this paper](https://arxiv.org/abs/1908.03557) and first released in
5
+ [this repository](https://github.com/uclanlp/visualbert). The output is 3129 class logits, the same classes as used by VisualBERT authors.
6
+
7
+ The initial weights are loaded from the Conceptual-12M 60k [checkpoints](https://huggingface.co/flax-community/clip-vision-bert-cc12m-60k).
8
+
9
+ We trained the CLIP-Vision-BERT VQA model during community week hosted by Huggingface 🤗 using JAX/Flax.
10
+
11
+ ## Model description
12
+ CLIP-Vision-BERT is a modified BERT model which takes in visual embeddings from the CLIP-Vision transformer and concatenates them with BERT textual embeddings before passing them to the self-attention layers of BERT. This is done for deep cross-modal interaction between the two modes.
13
+
14
+ ## Intended uses & limitations❗️
15
+ This model is fine-tuned on a multi-translated version of the visual question answering task - [VQA v2](https://visualqa.org/challenge.html). Since VQAv2 is a dataset scraped from the internet, it will involve some biases which will also affect all fine-tuned versions of this model.
16
+
17
+ ### How to use❓
18
+ You can use this model directly on visual question answering. You will need to clone the model from [here](https://github.com/gchhablani/multilingual-vqa). An example of usage is shown below:
19
+
20
+ ```python
21
+ >>> from torchvision.io import read_image
22
+ >>> import numpy as np
23
+ >>> import os
24
+ >>> from transformers import CLIPProcessor, BertTokenizerFast
25
+ >>> from model.flax_clip_vision_bert.modeling_clip_vision_bert import FlaxCLIPVisionBertForSequenceClassification
26
+ >>> image_path = os.path.join('images/val2014', os.listdir('images/val2014')[0])
27
+ >>> img = read_image(image_path)
28
+ >>> clip_processor = CLIPProcessor.from_pretrained('openai/clip-vit-base-patch32')
29
+ ftfy or spacy is not installed using BERT BasicTokenizer instead of ftfy.
30
+ >>> clip_outputs = clip_processor(images=img)
31
+ >>> clip_outputs['pixel_values'][0] = clip_outputs['pixel_values'][0].transpose(1,2,0) # Need to transpose images as model expected channel last images.
32
+ >>> tokenizer = BertTokenizerFast.from_pretrained('bert-base-multilingual-uncased')
33
+ >>> model = FlaxCLIPVisionBertForSequenceClassification.from_pretrained('flax-community/clip-vision-bert-vqa-ft-6k')
34
+ >>> text = "Are there teddy bears in the image?"
35
+ >>> tokens = tokenizer([text], return_tensors="np")
36
+ >>> pixel_values = np.concatenate([clip_outputs['pixel_values']])
37
+ >>> outputs = model(pixel_values=pixel_values, **tokens)
38
+ >>> preds = outputs.logits[0]
39
+ >>> sorted_indices = np.argsort(preds)[::-1] # Get reverse sorted scores
40
+ >>> top_5_indices = sorted_indices[:5]
41
+ >>> top_5_tokens = list(map(model.config.id2label.get,top_5_indices))
42
+ >>> top_5_scores = preds[top_5_indices]
43
+ >>> print(dict(zip(top_5_tokens, top_5_scores)))
44
+ {'yes': 15.809224, 'no': 7.8785815, '<unk>': 4.622649, 'very': 4.511462, 'neither': 3.600822}
45
+ ```
46
+
47
+ ## Training data 🏋🏻‍♂️
48
+ The CLIP-Vision-BERT model was fine-tuned on the translated version of the VQAv2 dataset in four languages using Marian: English, French, German and Spanish. Hence, the dataset is four times the original English questions.
49
+
50
+ The dataset questions and image URLs/paths can be downloaded from [flax-community/multilingual-vqa](https://huggingface.co/datasets/flax-community/multilingual-vqa).
51
+
52
+ ## Data Cleaning 🧹
53
+
54
+ Though the original dataset contains 443,757 train and 214,354 validation image-question pairs. We only use the `multiple_choice_answer`. The answers which are not present in the 3129 classes are mapped to the `<unk>` label.
55
+
56
+ **Splits**
57
+ We use the original train-val splits from the VQAv2 dataset. After translation, we get 1,775,028 train image-text pairs, and 857,416 validation image-text pairs.
58
+
59
+ ## Training procedure 👨🏻‍💻
60
+ ### Preprocessing
61
+ The texts are lowercased and tokenized using WordPiece and a shared vocabulary size of approximately 110,000. The beginning of a new document is marked with `[CLS]` and the end of one by `[SEP]`.
62
+
63
+ ### Fine-tuning
64
+ The checkpoint of the model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores) **8 v3 TPU cores** for 6k steps with a per device batch size of 128 and a max sequence length of 128. The optimizer used is AdamW with a learning rate of 5e-5, learning rate warmup for 1600 steps, and linear decay of the learning rate after.
65
+
66
+ We tracked experiments using TensorBoard. Here is link to main dashboard: [CLIP Vision BERT VQAv2 Fine-tuning Dashboard](https://huggingface.co/flax-community/multilingual-vqa-pt-60k-ft/tensorboard)
67
+
68
+
69
+ #### **Fine-tuning Results 📊**
70
+
71
+ The model at this checkpoint reached **eval accuracy of 0.49** on our multilingual VQAv2 dataset.
72
+
73
+
74
+ ## Team Members
75
+ - Gunjan Chhablani [@gchhablani](https://hf.co/gchhablani)
76
+ - Bhavitvya Malik[@bhavitvyamalik](https://hf.co/bhavitvyamalik)
77
+
78
+ ## Acknowledgements
79
+ We thank [Nilakshan Kunananthaseelan](https://huggingface.co/knilakshan20) for helping us whenever he could get a chance. We also thank [Abheesht Sharma](https://huggingface.co/abheesht) for helping in the discussions in the initial phases. [Luke Melas](https://github.com/lukemelas) helped us get the CC-12M data on our TPU-VMs and we are very grateful to him.
80
+
81
+ This project would not be possible without the help of [Patrick](https://huggingface.co/patrickvonplaten) and [Suraj](https://huggingface.co/valhalla) who met with us frequently and helped review our approach and guided us throughout the project.
82
+
83
+ Huge thanks to Huggingface 🤗 & Google Jax/Flax team for such a wonderful community week and for answering our queries on the Slack channel, and for providing us with the TPU-VMs.
84
+
85
+ <img src=https://pbs.twimg.com/media/E443fPjX0AY1BsR.jpg:large>