Spaces:

clip-italian
/

clip-italian-demo

Running

App Files Files Community

vinid commited on Jul 22, 2021

Commit

f882247

1 Parent(s): dbb7dfd

fixing a few things

Browse files

Files changed (1) hide show

introduction.md +54 -34

introduction.md CHANGED Viewed

@@ -1,14 +1,17 @@
 # Italian CLIP
-With a few tricks, we have been able to fine-tune a competitive Italian CLIP model with **only 1.4 million** training samples. Our Italian CLIP model
-is built upon the [Italian BERT](https://huggingface.co/dbmdz/bert-base-italian-xxl-cased) model provided by [dbmdz](https://huggingface.co/dbmdz) and the OpenAI
 [vision transformer](https://huggingface.co/openai/clip-vit-base-patch32).
 In building this project we kept in mind the following principles:
-+ **Novel Contributions**: We created a dataset of ~1.4 million Italian image-text pairs and, to the best of our knowledge, we trained the best Italian CLIP model currently in existence;
-+ **Scientific Validity**: Claim are easy, facts are hard. That's why validation is important to assess the real impact of a model. We thoroughly evaluated our models in several tasks and made the validation reproducible for everybody.
-+ **Broader Outlook**: We always kept in mind which are the possible usages for this model.
 We put our **hearts** and **souls** into the project during this week! Not only did we work on a cool project, but we were
 able to make new friends and and learn a lot from each other to work towards a common goal!
@@ -25,7 +28,7 @@ have the highest similarity with the text query.
 + *Image to Text*: This task is essentially a zero-shot image classification task. The user is asked for an image and for a set of captions/labels and CLIP
 is going to compute the similarity between the image and each label. The webapp is going to display a probability distribution over the captions.
-+ *Examples and Applications*: This page showcases some interesting results we got from the model, we believe that there are
 different applications that can start from here.
 # Novel Contributions
@@ -50,8 +53,8 @@ Thus, we tried to add as much data as possible while keeping the data-quality as
 We considered four main sources of data:
 + [WIT](https://github.com/google-research-datasets/wit) is an image-caption dataset collected from Wikipedia (see,
-[Srinivasan et al., 2021](https://arxiv.org/pdf/2103.01913.pdf)). We focused on the *Reference Description* captions described in the paper as they are
-the ones of highest quality. Nonetheless, many of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
 However, this kind of text, without more information, is not useful to learn a good mapping between images and captions.
   On the other hand, this text is written in Italian and it is of good quality. We cannot just remove short captions as some of those
   are still good (e.g., "running dog"). Thus, to prevent polluting the data with captions that are not meaningful, we used *POS tagging*
@@ -60,7 +63,7 @@ However, this kind of text, without more information, is not useful to learn a g
   Captions like: *'Dora Riparia', 'Anna Maria Mozzoni', 'Joey Ramone Place', 'Kim Rhodes', 'Ralph George Hawtrey' * have been removed.
-+ [MSCOCO-IT](https://github.com/crux82/mscoco-it). This image-caption dataset comes from the work by [Scaiella et al., 2019](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf). The captions comes from the original
 MSCOCO dataset and have been translated with Microsoft Translator. The 2017 version of the MSCOCO training set contains more than
 100K images, for each image more than one caption is available.
@@ -126,35 +129,43 @@ didn't go well. Eventually, the thing that worked out the best was fixing the lo
 is used after the computation of the similarity between the images and the texts in CLIP (see the code [here](https://github.com/clip-italian/clip-italian/blob/master/hybrid_clip/modeling_hybrid_clip.py#L64)).
 We got this idea from Nils' [video](https://youtu.be/RHXZKUr8qOY) on sentence embeddings.
-### Effect
-The following picture showcase the effect that these edits have had on our loss:
 <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/improvements.png" alt="drawing" width="95%"/>
-The purple line is the original training, you can see how many steps we needed to get the loss down. Yellow line is the
-loss with the new optimizer, it is **striking** to see the time we save from this addition! Blue line shows the results when
 fixed scaling is added with the new optimization. Finally, we added the backbone freezing part and you can see the
-results in the light blue loss.
 # Scientific Validity
 ## Quantitative Evaluation
 Those images are definitely cool and interesting, but a model is nothing without validation.
-To better understand how well our clip-italian model works we run an experimental evaluation. Since this is the first clip-based model in Italian, we used the multilingual CLIP model as a comparison baseline.
 ### mCLIP
 The multilingual CLIP (henceforth, mCLIP), is a model introduced by [Nils Reimers](https://www.sbert.net/docs/pretrained_models.html) in his
 [sentence-transformer](https://www.sbert.net/index.html) library. mCLIP is based on a multilingual encoder
-that was created through multilingual knowledge distillation (see [Reimers et al., 2020](https://aclanthology.org/2020.emnlp-main.365/)).
 ### Tasks
 We selected two different tasks:
-+ image-retrieval
-+ zero-shot classification
 ### Reproducibiliy
@@ -167,8 +178,8 @@ Both experiments should be very easy to replicate, we share the two colab notebo
 ### Image Retrieval
 This experiment is run against the MSCOCO-IT validation set (that we haven't used in training). Given in input
-a caption, we search for the most similar image in the MSCOCO-IT validation set. As evaluation metrics
-we use the MRR@K.
 | MRR             | CLIP-Italian | mCLIP |
 | --------------- | ------------ |-------|
@@ -176,16 +187,14 @@ we use the MRR@K.
 | MRR@5           | **0.5039**   | 0.3957|
 | MRR@10          | **0.5204**   | 0.4129|
-It is true that we used MSCOCO-IT in training, and this might give us an advantage. However the original CLIP model was trained
 on 400million images (and some of them probably were from MSCOCO).
 ### Zero-shot image classification
 This experiment replicates the original one run by OpenAI on zero-shot image classification on ImageNet.
 To do this, we used DeepL to translate the image labels in ImageNet. We evaluate the models computing the accuracy at different levels.
 | Accuracy        | CLIP-Italian | mCLIP |
 | --------------- | ------------ |-------|
 | Accuracy@1      |  **22.11**   | 20.15 |
@@ -193,17 +202,23 @@ To do this, we used DeepL to translate the image labels in ImageNet. We evaluate
 | Accuracy@10     |  **52.55**   | 42.91 |
 | Accuracy@100    |  **81.08**   | 67.11 |
 Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
 we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
-paper (see, [Radford et al., 2021](https://arxiv.org/abs/2103.00020)). However, considering that our results are in line with those obtained by mCLIP we think that
-the translated image labels might have had an impact on the final scores.
 ## Qualitative Evaluation
 We hereby show some very interesting properties of the model. One is its ability to detect colors,
-then there is its (partial) counting ability and finally the ability of understanding more complex quries. To our own surprise, many of the answers the model gives make a lot of sense!
 Look at the following - slightly cherry picked (but not even that much) - examples:
 ### Colors
@@ -241,14 +256,20 @@ early 1900 and it is part of the largest movie studios in Europe (Cinecittà).
 # Limitations and Bias
 Currently, the model is not without limits. To mention one, its counting capabilities seem very cool, but from our experiments the model
-finds difficult to count after three; this is a general limitation.
-There are even more serious limitations: we found some emergence of biases and stereotypes that got in our model from different factors: searching for "una troia" ("a bitch") on the
-CC dataset shows the picture of a woman. The model's capability even increase this issue, as searching for "due troie" ("two bitches")
 gives again, as a results, the picture of two women. BERT models are not free from bias.  Indeed, different BERT models - Italians included - are prone to create stereotyped sentences that are hurtful ([Nozza et al., 2021](https://www.aclweb.org/anthology/2021.naacl-main.191.pdf))
-This issue is common to many machine learning algorithms (check [Abit et al., 2021](https://arxiv.org/abs/2101.05783) for bias in GPT-3 as an example) and
 suggest we need to work even harder on this problem that affects our **society**.
 # References
 Abid, A., Farooqi, M., & Zou, J. (2021). [Persistent anti-muslim bias in large language models.](https://arxiv.org/abs/2101.05783) arXiv preprint arXiv:2101.05783.
@@ -267,6 +288,5 @@ Sharma, P., Ding, N., Goodman, S., & Soricut, R. (2018, July). [Conceptual capti
 Srinivasan, K., Raman, K., Chen, J., Bendersky, M., & Najork, M. (2021). [WIT: Wikipedia-based image text dataset for multimodal multilingual machine learning](https://arxiv.org/pdf/2103.01913.pdf). arXiv preprint arXiv:2103.01913.
 # Other Notes
 This readme has been designed using resources from Flaticon.com

 # Italian CLIP
+CLIP [Radford et al., 2021](https://arxiv.org/abs/2103.00020) is an amazing model that can learn to represent images and text jointly in the samp space.
+In this project, we aim to propose the first CLIP model trained on Italian data, that in this context can be considered a
+low resource language. Using a few smart techniques, we have been able to fine-tune a SOTA Italian CLIP model with **only 1.4 million** training samples. Our Italian CLIP model
+is built upon the pre-trained [Italian BERT](https://huggingface.co/dbmdz/bert-base-italian-xxl-cased) model provided by [dbmdz](https://huggingface.co/dbmdz) and the OpenAI
 [vision transformer](https://huggingface.co/openai/clip-vit-base-patch32).
 In building this project we kept in mind the following principles:
++ **Novel Contributions**: We created a dataset of ~1.4 million Italian image-text pairs (**that we will share with the community**) and, to the best of our knowledge, we trained the best Italian CLIP model currently in existence;
++ **Scientific Validity**: Claim are easy, facts are hard. That's why validation is important to assess the real impact of a model. We thoroughly evaluated our models on two tasks and made the validation reproducible for everybody.
++ **Broader Outlook**: We always kept in mind which are the possible usages and limitations of this model.
 We put our **hearts** and **souls** into the project during this week! Not only did we work on a cool project, but we were
 able to make new friends and and learn a lot from each other to work towards a common goal!
 + *Image to Text*: This task is essentially a zero-shot image classification task. The user is asked for an image and for a set of captions/labels and CLIP
 is going to compute the similarity between the image and each label. The webapp is going to display a probability distribution over the captions.
++ *Examples & Applications*: This page showcases some interesting results we got from the model, we believe that there are
 different applications that can start from here.
 # Novel Contributions
 We considered four main sources of data:
 + [WIT](https://github.com/google-research-datasets/wit) is an image-caption dataset collected from Wikipedia (see,
+[Srinivasan et al., 2021](https://arxiv.org/pdf/2103.01913.pdf)). We focused on the *Reference Description* captions
+described in the paper as they are the ones of highest quality. Nonetheless, many of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
 However, this kind of text, without more information, is not useful to learn a good mapping between images and captions.
   On the other hand, this text is written in Italian and it is of good quality. We cannot just remove short captions as some of those
   are still good (e.g., "running dog"). Thus, to prevent polluting the data with captions that are not meaningful, we used *POS tagging*
   Captions like: *'Dora Riparia', 'Anna Maria Mozzoni', 'Joey Ramone Place', 'Kim Rhodes', 'Ralph George Hawtrey' * have been removed.
++ [MSCOCO-IT](https://github.com/crux82/mscoco-it). This image-caption dataset comes from the work by [Scaiella et al., 2019](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf). The captions come from the original
 MSCOCO dataset and have been translated with Microsoft Translator. The 2017 version of the MSCOCO training set contains more than
 100K images, for each image more than one caption is available.
 is used after the computation of the similarity between the images and the texts in CLIP (see the code [here](https://github.com/clip-italian/clip-italian/blob/master/hybrid_clip/modeling_hybrid_clip.py#L64)).
 We got this idea from Nils' [video](https://youtu.be/RHXZKUr8qOY) on sentence embeddings.
+### Effect of Our Edits
+The following picture showcase the effect that these edits have had on our evaluation loss:
 <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/improvements.png" alt="drawing" width="95%"/>
+The purple line is the original training without any of our improvements, you can see how many steps we needed to get the loss down.
+Yellow line is the loss with the new optimizer, it is **striking** to see the time we save from this addition! Not only the loss improves, it
+also converges much faster! Blue line shows the results when
 fixed scaling is added with the new optimization. Finally, we added the backbone freezing part and you can see the
+results in the light blue loss. Nonetheless, as common in deep learning, having more data played a big role and was another key element
+to reduce the loss.
 # Scientific Validity
+We split this section in two: we first provide a quantitative evaluation to ensure that what we are learning is good and we then
+show some qualitative examples of images found by the model. **All the code we have written** to run our experiments (in combination with
+code made available by Nils Reimers and by the authors of the original CLIP) is available.
 ## Quantitative Evaluation
 Those images are definitely cool and interesting, but a model is nothing without validation.
+ Since this is the first clip-based model in Italian, we decided to use the multilingual CLIP model as a comparison baseline.
 ### mCLIP
 The multilingual CLIP (henceforth, mCLIP), is a model introduced by [Nils Reimers](https://www.sbert.net/docs/pretrained_models.html) in his
 [sentence-transformer](https://www.sbert.net/index.html) library. mCLIP is based on a multilingual encoder
+that was created through multilingual knowledge distillation (see [Reimers et al., 2020](https://aclanthology.org/2020.emnlp-main.365/)). It shows
+great capabilities in representing multilingual text in the same space of the images.
 ### Tasks
 We selected two different tasks:
++ image-retrieval, in which given a caption the model finds the most similar image
++ zero-shot classification, in which given an image and a set of captions (or labels), the model finds
+the best matching caption for the image
 ### Reproducibiliy
 ### Image Retrieval
 This experiment is run against the MSCOCO-IT validation set (that we haven't used in training). Given in input
+a caption from the dataset, we search for the most similar image in the MSCOCO-IT validation set and check if this is the one that was
+described by the original caption. As evaluation metrics we use the MRR@K.
 | MRR             | CLIP-Italian | mCLIP |
 | --------------- | ------------ |-------|
 | MRR@5           | **0.5039**   | 0.3957|
 | MRR@10          | **0.5204**   | 0.4129|
+It is true that we used MSCOCO-IT in training, and this might give us an advantage. However, the original CLIP model was trained
 on 400million images (and some of them probably were from MSCOCO).
 ### Zero-shot image classification
 This experiment replicates the original one run by OpenAI on zero-shot image classification on ImageNet.
 To do this, we used DeepL to translate the image labels in ImageNet. We evaluate the models computing the accuracy at different levels.
 | Accuracy        | CLIP-Italian | mCLIP |
 | --------------- | ------------ |-------|
 | Accuracy@1      |  **22.11**   | 20.15 |
 | Accuracy@10     |  **52.55**   | 42.91 |
 | Accuracy@100    |  **81.08**   | 67.11 |
+### Discussion
 Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
 we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
+paper (see, [Radford et al., 2021](https://arxiv.org/abs/2103.00020)) that was evaluated on English data.
+However, considering that our results are in line with those obtained by mCLIP we think that the translated image
+labels might have had an impact on the final scores.
 ## Qualitative Evaluation
 We hereby show some very interesting properties of the model. One is its ability to detect colors,
+then there is its (partial) counting ability and finally the ability of understanding more complex queries. You can find
+more examples in the "*Examples & Applications*" section of this demo.
+To our own surprise, many of the answers the model gives make a lot of sense! Note that the model, in this case,
+is searching the right image from a set of 25K images from an Unsplash dataset.
 Look at the following - slightly cherry picked (but not even that much) - examples:
 ### Colors
 # Limitations and Bias
 Currently, the model is not without limits. To mention one, its counting capabilities seem very cool, but from our experiments the model
+finds difficult to count after three; this is a general limitation that is common to many model like this.
+There are even more evident issues: we found some emergence of biases and stereotypes that got in our model from different factors:
+ searching for "una troia" ("a bitch") on the CC dataset shows the picture of a woman. The model's capability even increase this issue, as searching for "due troie" ("two bitches")
 gives again, as a results, the picture of two women. BERT models are not free from bias.  Indeed, different BERT models - Italians included - are prone to create stereotyped sentences that are hurtful ([Nozza et al., 2021](https://www.aclweb.org/anthology/2021.naacl-main.191.pdf))
+Unfortunately, this kind of issues is common to many machine learning algorithms (check [Abit et al., 2021](https://arxiv.org/abs/2101.05783) for bias in GPT-3 as an example) and
 suggest we need to work even harder on this problem that affects our **society**.
+# Useful Links
++ [GitHub Repository](https://github.com/clip-italian/clip-italian)
++ [Model on HuggingFace](https://huggingface.co/clip-italian/clip-italian)
 # References
 Abid, A., Farooqi, M., & Zou, J. (2021). [Persistent anti-muslim bias in large language models.](https://arxiv.org/abs/2101.05783) arXiv preprint arXiv:2101.05783.
 Srinivasan, K., Raman, K., Chen, J., Bendersky, M., & Najork, M. (2021). [WIT: Wikipedia-based image text dataset for multimodal multilingual machine learning](https://arxiv.org/pdf/2103.01913.pdf). arXiv preprint arXiv:2103.01913.
 # Other Notes
 This readme has been designed using resources from Flaticon.com