Spaces:
Running
Running
fixing a few things
Browse files- introduction.md +54 -34
introduction.md
CHANGED
@@ -1,14 +1,17 @@
|
|
1 |
# Italian CLIP
|
2 |
|
3 |
-
|
4 |
-
|
|
|
|
|
|
|
5 |
[vision transformer](https://huggingface.co/openai/clip-vit-base-patch32).
|
6 |
|
7 |
In building this project we kept in mind the following principles:
|
8 |
|
9 |
-
+ **Novel Contributions**: We created a dataset of ~1.4 million Italian image-text pairs and, to the best of our knowledge, we trained the best Italian CLIP model currently in existence;
|
10 |
-
+ **Scientific Validity**: Claim are easy, facts are hard. That's why validation is important to assess the real impact of a model. We thoroughly evaluated our models
|
11 |
-
+ **Broader Outlook**: We always kept in mind which are the possible usages
|
12 |
|
13 |
We put our **hearts** and **souls** into the project during this week! Not only did we work on a cool project, but we were
|
14 |
able to make new friends and and learn a lot from each other to work towards a common goal!
|
@@ -25,7 +28,7 @@ have the highest similarity with the text query.
|
|
25 |
+ *Image to Text*: This task is essentially a zero-shot image classification task. The user is asked for an image and for a set of captions/labels and CLIP
|
26 |
is going to compute the similarity between the image and each label. The webapp is going to display a probability distribution over the captions.
|
27 |
|
28 |
-
+ *Examples
|
29 |
different applications that can start from here.
|
30 |
|
31 |
# Novel Contributions
|
@@ -50,8 +53,8 @@ Thus, we tried to add as much data as possible while keeping the data-quality as
|
|
50 |
We considered four main sources of data:
|
51 |
|
52 |
+ [WIT](https://github.com/google-research-datasets/wit) is an image-caption dataset collected from Wikipedia (see,
|
53 |
-
[Srinivasan et al., 2021](https://arxiv.org/pdf/2103.01913.pdf)). We focused on the *Reference Description* captions
|
54 |
-
the ones of highest quality. Nonetheless, many of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
|
55 |
However, this kind of text, without more information, is not useful to learn a good mapping between images and captions.
|
56 |
On the other hand, this text is written in Italian and it is of good quality. We cannot just remove short captions as some of those
|
57 |
are still good (e.g., "running dog"). Thus, to prevent polluting the data with captions that are not meaningful, we used *POS tagging*
|
@@ -60,7 +63,7 @@ However, this kind of text, without more information, is not useful to learn a g
|
|
60 |
|
61 |
Captions like: *'Dora Riparia', 'Anna Maria Mozzoni', 'Joey Ramone Place', 'Kim Rhodes', 'Ralph George Hawtrey' * have been removed.
|
62 |
|
63 |
-
+ [MSCOCO-IT](https://github.com/crux82/mscoco-it). This image-caption dataset comes from the work by [Scaiella et al., 2019](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf). The captions
|
64 |
MSCOCO dataset and have been translated with Microsoft Translator. The 2017 version of the MSCOCO training set contains more than
|
65 |
100K images, for each image more than one caption is available.
|
66 |
|
@@ -126,35 +129,43 @@ didn't go well. Eventually, the thing that worked out the best was fixing the lo
|
|
126 |
is used after the computation of the similarity between the images and the texts in CLIP (see the code [here](https://github.com/clip-italian/clip-italian/blob/master/hybrid_clip/modeling_hybrid_clip.py#L64)).
|
127 |
We got this idea from Nils' [video](https://youtu.be/RHXZKUr8qOY) on sentence embeddings.
|
128 |
|
129 |
-
### Effect
|
130 |
|
131 |
-
The following picture showcase the effect that these edits have had on our loss:
|
132 |
|
133 |
<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/improvements.png" alt="drawing" width="95%"/>
|
134 |
|
135 |
-
The purple line is the original training, you can see how many steps we needed to get the loss down.
|
136 |
-
loss with the new optimizer, it is **striking** to see the time we save from this addition!
|
|
|
137 |
fixed scaling is added with the new optimization. Finally, we added the backbone freezing part and you can see the
|
138 |
-
results in the light blue loss.
|
|
|
139 |
|
140 |
|
141 |
# Scientific Validity
|
142 |
|
|
|
|
|
|
|
|
|
143 |
## Quantitative Evaluation
|
144 |
Those images are definitely cool and interesting, but a model is nothing without validation.
|
145 |
-
|
146 |
|
147 |
### mCLIP
|
148 |
|
149 |
The multilingual CLIP (henceforth, mCLIP), is a model introduced by [Nils Reimers](https://www.sbert.net/docs/pretrained_models.html) in his
|
150 |
[sentence-transformer](https://www.sbert.net/index.html) library. mCLIP is based on a multilingual encoder
|
151 |
-
that was created through multilingual knowledge distillation (see [Reimers et al., 2020](https://aclanthology.org/2020.emnlp-main.365/)).
|
|
|
152 |
|
153 |
### Tasks
|
154 |
|
155 |
We selected two different tasks:
|
156 |
-
+ image-retrieval
|
157 |
-
+ zero-shot classification
|
|
|
158 |
|
159 |
### Reproducibiliy
|
160 |
|
@@ -167,8 +178,8 @@ Both experiments should be very easy to replicate, we share the two colab notebo
|
|
167 |
### Image Retrieval
|
168 |
|
169 |
This experiment is run against the MSCOCO-IT validation set (that we haven't used in training). Given in input
|
170 |
-
a caption, we search for the most similar image in the MSCOCO-IT validation set
|
171 |
-
we use the MRR@K.
|
172 |
|
173 |
| MRR | CLIP-Italian | mCLIP |
|
174 |
| --------------- | ------------ |-------|
|
@@ -176,16 +187,14 @@ we use the MRR@K.
|
|
176 |
| MRR@5 | **0.5039** | 0.3957|
|
177 |
| MRR@10 | **0.5204** | 0.4129|
|
178 |
|
179 |
-
It is true that we used MSCOCO-IT in training, and this might give us an advantage. However the original CLIP model was trained
|
180 |
on 400million images (and some of them probably were from MSCOCO).
|
181 |
|
182 |
-
|
183 |
### Zero-shot image classification
|
184 |
|
185 |
This experiment replicates the original one run by OpenAI on zero-shot image classification on ImageNet.
|
186 |
To do this, we used DeepL to translate the image labels in ImageNet. We evaluate the models computing the accuracy at different levels.
|
187 |
|
188 |
-
|
189 |
| Accuracy | CLIP-Italian | mCLIP |
|
190 |
| --------------- | ------------ |-------|
|
191 |
| Accuracy@1 | **22.11** | 20.15 |
|
@@ -193,17 +202,23 @@ To do this, we used DeepL to translate the image labels in ImageNet. We evaluate
|
|
193 |
| Accuracy@10 | **52.55** | 42.91 |
|
194 |
| Accuracy@100 | **81.08** | 67.11 |
|
195 |
|
|
|
|
|
196 |
Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
|
197 |
we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
|
198 |
-
paper (see, [Radford et al., 2021](https://arxiv.org/abs/2103.00020))
|
199 |
-
|
200 |
-
|
201 |
-
|
202 |
|
203 |
## Qualitative Evaluation
|
204 |
|
205 |
We hereby show some very interesting properties of the model. One is its ability to detect colors,
|
206 |
-
then there is its (partial) counting ability and finally the ability of understanding more complex
|
|
|
|
|
|
|
|
|
|
|
207 |
Look at the following - slightly cherry picked (but not even that much) - examples:
|
208 |
|
209 |
### Colors
|
@@ -241,14 +256,20 @@ early 1900 and it is part of the largest movie studios in Europe (Cinecittà).
|
|
241 |
# Limitations and Bias
|
242 |
|
243 |
Currently, the model is not without limits. To mention one, its counting capabilities seem very cool, but from our experiments the model
|
244 |
-
finds difficult to count after three; this is a general limitation.
|
245 |
-
|
246 |
-
|
|
|
247 |
gives again, as a results, the picture of two women. BERT models are not free from bias. Indeed, different BERT models - Italians included - are prone to create stereotyped sentences that are hurtful ([Nozza et al., 2021](https://www.aclweb.org/anthology/2021.naacl-main.191.pdf))
|
248 |
-
|
249 |
-
|
250 |
suggest we need to work even harder on this problem that affects our **society**.
|
251 |
|
|
|
|
|
|
|
|
|
|
|
252 |
# References
|
253 |
|
254 |
Abid, A., Farooqi, M., & Zou, J. (2021). [Persistent anti-muslim bias in large language models.](https://arxiv.org/abs/2101.05783) arXiv preprint arXiv:2101.05783.
|
@@ -267,6 +288,5 @@ Sharma, P., Ding, N., Goodman, S., & Soricut, R. (2018, July). [Conceptual capti
|
|
267 |
|
268 |
Srinivasan, K., Raman, K., Chen, J., Bendersky, M., & Najork, M. (2021). [WIT: Wikipedia-based image text dataset for multimodal multilingual machine learning](https://arxiv.org/pdf/2103.01913.pdf). arXiv preprint arXiv:2103.01913.
|
269 |
|
270 |
-
|
271 |
# Other Notes
|
272 |
This readme has been designed using resources from Flaticon.com
|
|
|
1 |
# Italian CLIP
|
2 |
|
3 |
+
CLIP [Radford et al., 2021](https://arxiv.org/abs/2103.00020) is an amazing model that can learn to represent images and text jointly in the samp space.
|
4 |
+
|
5 |
+
In this project, we aim to propose the first CLIP model trained on Italian data, that in this context can be considered a
|
6 |
+
low resource language. Using a few smart techniques, we have been able to fine-tune a SOTA Italian CLIP model with **only 1.4 million** training samples. Our Italian CLIP model
|
7 |
+
is built upon the pre-trained [Italian BERT](https://huggingface.co/dbmdz/bert-base-italian-xxl-cased) model provided by [dbmdz](https://huggingface.co/dbmdz) and the OpenAI
|
8 |
[vision transformer](https://huggingface.co/openai/clip-vit-base-patch32).
|
9 |
|
10 |
In building this project we kept in mind the following principles:
|
11 |
|
12 |
+
+ **Novel Contributions**: We created a dataset of ~1.4 million Italian image-text pairs (**that we will share with the community**) and, to the best of our knowledge, we trained the best Italian CLIP model currently in existence;
|
13 |
+
+ **Scientific Validity**: Claim are easy, facts are hard. That's why validation is important to assess the real impact of a model. We thoroughly evaluated our models on two tasks and made the validation reproducible for everybody.
|
14 |
+
+ **Broader Outlook**: We always kept in mind which are the possible usages and limitations of this model.
|
15 |
|
16 |
We put our **hearts** and **souls** into the project during this week! Not only did we work on a cool project, but we were
|
17 |
able to make new friends and and learn a lot from each other to work towards a common goal!
|
|
|
28 |
+ *Image to Text*: This task is essentially a zero-shot image classification task. The user is asked for an image and for a set of captions/labels and CLIP
|
29 |
is going to compute the similarity between the image and each label. The webapp is going to display a probability distribution over the captions.
|
30 |
|
31 |
+
+ *Examples & Applications*: This page showcases some interesting results we got from the model, we believe that there are
|
32 |
different applications that can start from here.
|
33 |
|
34 |
# Novel Contributions
|
|
|
53 |
We considered four main sources of data:
|
54 |
|
55 |
+ [WIT](https://github.com/google-research-datasets/wit) is an image-caption dataset collected from Wikipedia (see,
|
56 |
+
[Srinivasan et al., 2021](https://arxiv.org/pdf/2103.01913.pdf)). We focused on the *Reference Description* captions
|
57 |
+
described in the paper as they are the ones of highest quality. Nonetheless, many of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
|
58 |
However, this kind of text, without more information, is not useful to learn a good mapping between images and captions.
|
59 |
On the other hand, this text is written in Italian and it is of good quality. We cannot just remove short captions as some of those
|
60 |
are still good (e.g., "running dog"). Thus, to prevent polluting the data with captions that are not meaningful, we used *POS tagging*
|
|
|
63 |
|
64 |
Captions like: *'Dora Riparia', 'Anna Maria Mozzoni', 'Joey Ramone Place', 'Kim Rhodes', 'Ralph George Hawtrey' * have been removed.
|
65 |
|
66 |
+
+ [MSCOCO-IT](https://github.com/crux82/mscoco-it). This image-caption dataset comes from the work by [Scaiella et al., 2019](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf). The captions come from the original
|
67 |
MSCOCO dataset and have been translated with Microsoft Translator. The 2017 version of the MSCOCO training set contains more than
|
68 |
100K images, for each image more than one caption is available.
|
69 |
|
|
|
129 |
is used after the computation of the similarity between the images and the texts in CLIP (see the code [here](https://github.com/clip-italian/clip-italian/blob/master/hybrid_clip/modeling_hybrid_clip.py#L64)).
|
130 |
We got this idea from Nils' [video](https://youtu.be/RHXZKUr8qOY) on sentence embeddings.
|
131 |
|
132 |
+
### Effect of Our Edits
|
133 |
|
134 |
+
The following picture showcase the effect that these edits have had on our evaluation loss:
|
135 |
|
136 |
<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/improvements.png" alt="drawing" width="95%"/>
|
137 |
|
138 |
+
The purple line is the original training without any of our improvements, you can see how many steps we needed to get the loss down.
|
139 |
+
Yellow line is the loss with the new optimizer, it is **striking** to see the time we save from this addition! Not only the loss improves, it
|
140 |
+
also converges much faster! Blue line shows the results when
|
141 |
fixed scaling is added with the new optimization. Finally, we added the backbone freezing part and you can see the
|
142 |
+
results in the light blue loss. Nonetheless, as common in deep learning, having more data played a big role and was another key element
|
143 |
+
to reduce the loss.
|
144 |
|
145 |
|
146 |
# Scientific Validity
|
147 |
|
148 |
+
We split this section in two: we first provide a quantitative evaluation to ensure that what we are learning is good and we then
|
149 |
+
show some qualitative examples of images found by the model. **All the code we have written** to run our experiments (in combination with
|
150 |
+
code made available by Nils Reimers and by the authors of the original CLIP) is available.
|
151 |
+
|
152 |
## Quantitative Evaluation
|
153 |
Those images are definitely cool and interesting, but a model is nothing without validation.
|
154 |
+
Since this is the first clip-based model in Italian, we decided to use the multilingual CLIP model as a comparison baseline.
|
155 |
|
156 |
### mCLIP
|
157 |
|
158 |
The multilingual CLIP (henceforth, mCLIP), is a model introduced by [Nils Reimers](https://www.sbert.net/docs/pretrained_models.html) in his
|
159 |
[sentence-transformer](https://www.sbert.net/index.html) library. mCLIP is based on a multilingual encoder
|
160 |
+
that was created through multilingual knowledge distillation (see [Reimers et al., 2020](https://aclanthology.org/2020.emnlp-main.365/)). It shows
|
161 |
+
great capabilities in representing multilingual text in the same space of the images.
|
162 |
|
163 |
### Tasks
|
164 |
|
165 |
We selected two different tasks:
|
166 |
+
+ image-retrieval, in which given a caption the model finds the most similar image
|
167 |
+
+ zero-shot classification, in which given an image and a set of captions (or labels), the model finds
|
168 |
+
the best matching caption for the image
|
169 |
|
170 |
### Reproducibiliy
|
171 |
|
|
|
178 |
### Image Retrieval
|
179 |
|
180 |
This experiment is run against the MSCOCO-IT validation set (that we haven't used in training). Given in input
|
181 |
+
a caption from the dataset, we search for the most similar image in the MSCOCO-IT validation set and check if this is the one that was
|
182 |
+
described by the original caption. As evaluation metrics we use the MRR@K.
|
183 |
|
184 |
| MRR | CLIP-Italian | mCLIP |
|
185 |
| --------------- | ------------ |-------|
|
|
|
187 |
| MRR@5 | **0.5039** | 0.3957|
|
188 |
| MRR@10 | **0.5204** | 0.4129|
|
189 |
|
190 |
+
It is true that we used MSCOCO-IT in training, and this might give us an advantage. However, the original CLIP model was trained
|
191 |
on 400million images (and some of them probably were from MSCOCO).
|
192 |
|
|
|
193 |
### Zero-shot image classification
|
194 |
|
195 |
This experiment replicates the original one run by OpenAI on zero-shot image classification on ImageNet.
|
196 |
To do this, we used DeepL to translate the image labels in ImageNet. We evaluate the models computing the accuracy at different levels.
|
197 |
|
|
|
198 |
| Accuracy | CLIP-Italian | mCLIP |
|
199 |
| --------------- | ------------ |-------|
|
200 |
| Accuracy@1 | **22.11** | 20.15 |
|
|
|
202 |
| Accuracy@10 | **52.55** | 42.91 |
|
203 |
| Accuracy@100 | **81.08** | 67.11 |
|
204 |
|
205 |
+
### Discussion
|
206 |
+
|
207 |
Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
|
208 |
we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
|
209 |
+
paper (see, [Radford et al., 2021](https://arxiv.org/abs/2103.00020)) that was evaluated on English data.
|
210 |
+
However, considering that our results are in line with those obtained by mCLIP we think that the translated image
|
211 |
+
labels might have had an impact on the final scores.
|
|
|
212 |
|
213 |
## Qualitative Evaluation
|
214 |
|
215 |
We hereby show some very interesting properties of the model. One is its ability to detect colors,
|
216 |
+
then there is its (partial) counting ability and finally the ability of understanding more complex queries. You can find
|
217 |
+
more examples in the "*Examples & Applications*" section of this demo.
|
218 |
+
|
219 |
+
To our own surprise, many of the answers the model gives make a lot of sense! Note that the model, in this case,
|
220 |
+
is searching the right image from a set of 25K images from an Unsplash dataset.
|
221 |
+
|
222 |
Look at the following - slightly cherry picked (but not even that much) - examples:
|
223 |
|
224 |
### Colors
|
|
|
256 |
# Limitations and Bias
|
257 |
|
258 |
Currently, the model is not without limits. To mention one, its counting capabilities seem very cool, but from our experiments the model
|
259 |
+
finds difficult to count after three; this is a general limitation that is common to many model like this.
|
260 |
+
|
261 |
+
There are even more evident issues: we found some emergence of biases and stereotypes that got in our model from different factors:
|
262 |
+
searching for "una troia" ("a bitch") on the CC dataset shows the picture of a woman. The model's capability even increase this issue, as searching for "due troie" ("two bitches")
|
263 |
gives again, as a results, the picture of two women. BERT models are not free from bias. Indeed, different BERT models - Italians included - are prone to create stereotyped sentences that are hurtful ([Nozza et al., 2021](https://www.aclweb.org/anthology/2021.naacl-main.191.pdf))
|
264 |
+
|
265 |
+
Unfortunately, this kind of issues is common to many machine learning algorithms (check [Abit et al., 2021](https://arxiv.org/abs/2101.05783) for bias in GPT-3 as an example) and
|
266 |
suggest we need to work even harder on this problem that affects our **society**.
|
267 |
|
268 |
+
# Useful Links
|
269 |
+
|
270 |
+
+ [GitHub Repository](https://github.com/clip-italian/clip-italian)
|
271 |
+
+ [Model on HuggingFace](https://huggingface.co/clip-italian/clip-italian)
|
272 |
+
|
273 |
# References
|
274 |
|
275 |
Abid, A., Farooqi, M., & Zou, J. (2021). [Persistent anti-muslim bias in large language models.](https://arxiv.org/abs/2101.05783) arXiv preprint arXiv:2101.05783.
|
|
|
288 |
|
289 |
Srinivasan, K., Raman, K., Chen, J., Bendersky, M., & Najork, M. (2021). [WIT: Wikipedia-based image text dataset for multimodal multilingual machine learning](https://arxiv.org/pdf/2103.01913.pdf). arXiv preprint arXiv:2103.01913.
|
290 |
|
|
|
291 |
# Other Notes
|
292 |
This readme has been designed using resources from Flaticon.com
|