Spaces:
Running
Running
# Italian CLIP | |
With a few tricks, we have been able to fine-tune a competitive Italian CLIP model with **only 1.4 million** training samples. | |
In building this project we kept in mind the following principles: | |
+ **Novel Contributions**: We created a dataset of ~1.4 million Italian image-text pairs and, to the best of our knowledge, we trained the best Italian CLIP model currently in existence; | |
+ **Scientific Validity**: Claim are easy, facts are hard. That's why validation is important to assess the real impact of a model. We thoroughly evaluated our models in several tasks and made the validation reproducible for everybody. | |
+ **Broader Outlook**: We always kept in mind which are the possible usages for this model. | |
We put our **hearts** and **souls** into the project during this week! Not only did we work on a cool project, but we were | |
able to make new friends and and learn a lot from each other to work towards a common goal! | |
Thank you for this amazing opportunity, we hope you will like the results. :heart: | |
# Novel Contributions | |
The original CLIP model was trained on 400 million image-text pairs; this amount of data is not available for Italian. | |
We indeed worked in a **low-resource setting**. The only datasets for captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT. | |
To get competitive results we followed three strategies: | |
1. more data; | |
2. better augmentations; | |
3. better training. | |
## More Data | |
We eventually had to deal with the fact that we do not have the same data that OpenAI had during the training of CLIP. | |
Thus, we tried to add as much data as possible while keeping the data-quality as high as possible. | |
We considered three main sources of data: | |
+ WIT. Most of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994). | |
However, this kind of text, without more information, is not useful to learn a good mapping between images and captions. | |
On the other hand, this text is written in Italian and it is good quality. | |
To prevent polluting the data with captions that are not meaningful, we used POS tagging | |
on the data and removed all the captions that were composed for the 80% or more by PROPN. | |
Example: .... | |
+ MSCOCO-IT. | |
+ Conceptual Captions. | |
## Better Augmentations | |
## Better Training | |
After different trials, we realized that the usual way of training this model was | |
not good enough to get good results. We thus modified two different parts of the | |
training pipeline: the optimizer and the training with frozen components. | |
### Optimizer | |
The standard AdamW didn't seem enough to train the model... | |
### Backbone Freezing | |
<img src="static/img/clip-italian.png" alt="drawing" width="200"/> | |
# Scientific Validity | |
## Quantitative Evaluation | |
Those images are definitely cool and interesting, but a model is nothing without validation. | |
To better understand how well our clip-italian model works we run an experimental evaluation. Since this is the first clip-based model in Italian, we used the multilingual CLIP model as a comparison baseline. | |
### mCLIP | |
### Experiments Replication | |
We provide two colab notebooks to replicate both experiments. | |
### Tasks | |
We selected two different tasks: | |
+ image-retrieval | |
+ zero-shot classification | |
### Image Retrieval | |
| MRR | CLIP-Italian | mCLIP | | |
| --------------- | ------------ |-------| | |
| MRR@1 | **0.3797** | | | |
| MRR@5 | **0.5039** | | | |
| MRR@10 | **0.5204** | | | |
[Colab: Image Retrieval Evaluation](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing) | |
### Zero-shot classification | |
| Accuracy | CLIP-Italian | mCLIP | | |
| --------------- | ------------ |-------| | |
| Accuracy@1 | **22.11** | 20.15 | | |
| Accuracy@5 | **43.69** | 36.57 | | |
| Accuracy@10 | **52.55** | 42.91 | | |
| Accuracy@100 | **81.08** | 67.11 | | |
[Colab: ImageNet Zero Shot Evaluation](https://colab.research.google.com/drive/1zfWeVWY79XXH63Ci-pk8xxx3Vu_RRgW-?usp=sharing) | |
## Qualitative Evaluation | |
# Broader Outlook | |
# Other Notes | |
This readme has been designed using resources from Flaticon.com |