Spaces:
Running
Running
SigLIP just got merged to 🤗transformers and it's super easy to use! To celebrate this, I have created a repository on various SigLIP based projects! | |
But what is it and how does it work? SigLIP an vision-text pre-training technique based on contrastive learning. | |
It jointly trains an image encoder and text encoder such that the dot product of embeddings are most similar for the appropriate text-image pairs. | |
The image below is taken from CLIP, where this contrastive pre-training takes place with softmax, but SigLIP replaces softmax with sigmoid. 📎 | |
data:image/s3,"s3://crabby-images/0a9a8/0a9a8cb26eef52220292abcb08a9570b0c374ccc" alt="image_1" | |
Highlights✨ | |
🖼️📝 Authors used medium sized B/16 ViT for image encoder and B-sized transformer for text encoder | |
😍 More performant than CLIP on zero-shot | |
🗣️ Authors trained a multilingual model too! | |
⚡️ Super efficient, sigmoid is enabling up to 1M items per batch, but the authors chose 32k (see saturation on perf below) | |
data:image/s3,"s3://crabby-images/f2960/f2960c2dcb492c98289c9271493128541f7a4b1c" alt="image_2" | |
Below you can find prior CLIP models and SigLIP across different image encoder sizes and their performance on different datasets 👇🏻 | |
data:image/s3,"s3://crabby-images/67e09/67e09f90d3f94c1f7d0a046c261f1081e5c13f54" alt="image_3" | |
With 🤗 Transformers integration there comes zero-shot-image-classification pipeline, makes SigLIP super easy to use! | |
data:image/s3,"s3://crabby-images/17ec4/17ec4d3a5ae3717e47385d982f96ecc1a926f123" alt="image_4" | |
What to use SigLIP for? 🧐 | |
Honestly the possibilities are endless, but you can use it for image/text retrieval, zero-shot classification, training multimodal models! | |
I have made a repository with notebooks and applications that are also hosted on [Spaces ](https://t.co/Ah1CrHVuPY). | |
I have built ["Draw to Search Art"](https://t.co/DcmQWMc1qd) where you can input image (upload one or draw) and search among 10k images in wikiart! | |
I've also built apps to [compare](https://t.co/m699TMvuW9)CLIP and SigLIP outputs. | |
data:image/s3,"s3://crabby-images/96cea/96cea3f889fdd199e689ebdccb1f6494b8ea2575" alt="image_5" | |
> [!TIP] | |
Ressources: | |
[Sigmoid Loss for Language Image Pre-Training](Sigmoid Loss for Language Image Pre-Training) | |
by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer (2023) | |
[GitHub](https://github.com/google-research/big_vision) | |
[Hugging Face documentation](https://huggingface.co./docs/transformers/model_doc/siglip) | |
> [!NOTE] | |
[Original tweet](https://twitter.com/mervenoyann/status/1745476609686089800) (January 11. 2024) |