---
tags:
  - feature-extraction
  - sentence-similarity
  - mteb
language: en
inference: false
license: apache-2.0
---
<!-- TODO: add evaluation results here -->
<p align="center">
<img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
</p>

# jina-clip-v1
Jina CLIP: your CLIP model is also your text retriever!


## Intended Usage & Model Info

`jina-clip-v1` is a state-of-the-art English **multimodal (text-image) embedding model**.

Traditional text embedding models, such as [jina-embeddings-v2-base-en](https://huggingface.co./jinaai/jina-embeddings-v2-base-en), excel in text-to-text retrieval but fall short in cross-modal tasks. In contrast, models like [openai/clip-vit-base-patch32](https://huggingface.co./openai/clip-vit-base-patch32) effectively align image and text embeddings but are not optimized for text-to-text retrieval due to their training methodologies and context limitations.

`jina-clip-v1` bridges this gap by offering robust performance in both domains. Its text component matches the retrieval efficiency of `jina-embeddings-v2-base-en`, while its overall architecture sets a new benchmark for cross-modal retrieval. This dual capability makes it an excellent tool for multimodal retrieval-augmented generation (M-RAG) applications, enabling seamless text-to-text and text-to-image searches within a single model.


## Data & Parameters

[Check out our paper](https://arxiv.org/abs/2405.20204)

## Usage

You can use Jina CLIP directly via transformers package.

```python
!pip install transformers einops timm pillow
from transformers import AutoModel
from numpy.linalg import norm

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))

model = AutoModel.from_pretrained('jinaai/jina-clip-v1', trust_remote_code=True)

sentences = ['How is the weather today?', 'What is the current weather like today?']
images = ['raindrop.jpg', 'sunny.jpg']

text_embeddings = model.encode_text(sentences)
image_embeddings = model.encode_image(images)

print(cos_sim(text_embeddings[0], text_embeddings[1])) # text embedding similarity
print(cos_sim(text_embeddings[0], image_embeddings[0])) # text-image cross-modal similarity
```


## Performance

### Text-Image Retrieval

| Name             | Flickr Image Retr. R@1 | Flickr Image Retr. R@5 | Flickr Text Retr. R@1 | Flickr Text Retr. R@5 |
|------------------|-------------------------|-------------------------|-----------------------|-----------------------|
| ViT-B-32         | 0.597                   | 0.8398                  | 0.781                 | 0.938                 |
| ViT-B-16         | 0.6216                  | 0.8572                  | 0.822                 | 0.966                 |
| jina-clip        | 0.6748                  | 0.8902                  | 0.811                 | 0.965                 |


| Name             | MSCOCO Image Retr. R@1  | MSCOCO Image Retr. R@5 | MSCOCO Text Retr. R@1 | MSCOCO Text Retr. R@5 |
|------------------|-------------------------|-------------------------|-----------------------|-----------------------|
| ViT-B-32         | 0.342                   | 0.6001                  | 0.5234                | 0.7634                |
| ViT-B-16         | 0.3309                  | 0.5842                  | 0.5242                | 0.767                 |
| jina-clip        | 0.4111                  | 0.6644                  | 0.5544                | 0.7904                |

### Text-Text Retrieval

| Name                  | STS12  | STS15  | STS17  | STS13  | STS14  | STS16  | STS22  | STSBenchmark | SummEval |
|-----------------------|--------|--------|--------|--------|--------|--------|--------|--------------|----------|
| jina-embeddings-v2    | 0.7427 | 0.8755 | 0.8888 | 0.833  | 0.7917 | 0.836  | 0.6346 | 0.8404       | 0.3056   |
| jina-clip             | 0.7352 | 0.8746 | 0.8976 | 0.8323 | 0.7868 | 0.8377 | 0.6583 | 0.8493       | 0.3048   |


| Name               | ArguAna | FiQA2018 | NFCorpus | Quora | SCIDOCS | SciFact | TRECCOVID |
|--------------------|---------|----------|----------|-------|---------|---------|-----------|
| jina-embeddings-v2 | 0.4418  | 0.4158   | 0.3245   | 0.882 | 0.1986  | 0.6668  | 0.6591    |
| jina-clip          | 0.4933  | 0.3827   | 0.3352   | 0.8789| 0.2024  | 0.6734  | 0.7161    |

## Contact

Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.

## Citation

If you find `jina-clip-v1` useful in your research, please cite the following paper:

```bibtex
@misc{2405.20204,
Author = {Andreas Koukounas and Georgios Mastrapas and Michael Günther and Bo Wang and Scott Martens and Isabelle Mohr and Saba Sturua and Mohammad Kalim Akram and Joan Fontanals Martínez and Saahil Ognawala and Susana Guzman and Maximilian Werk and Nan Wang and Han Xiao},
Title = {Jina CLIP: Your CLIP Model Is Also Your Text Retriever},
Year = {2024},
Eprint = {arXiv:2405.20204},
}
```


**notice: our emperical study shows that text-text cosine similarity is normally larger than text-image cosine similarity!**
If you want to merge two scores, we recommended 2 ways:

1. weighted average of text-text sim and text-image sim:

```python
# pseudo code
alpha = 0.6
beta = 0.4

combined_scores = alpha * sim(query, document) + beta * sim(text, image)
```

2. apply z-score normalization before merging scores:

```python
# pseudo code
query_document_mean = np.mean(cos_sim_query_documents)
query_document_std = np.std(cos_sim_query_documents)
text_image_mean = np.mean(cos_sim_text_images)
text_image_std = np.std(cos_sim_text_images)

query_document_sim_normalized = (cos_sim_query_documents - query_document_mean) / query_document_std
text_image_sim_normalized = (cos_sim_text_images - text_image_mean) / text_image_std
```