File size: 2,834 Bytes
6417422
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1f9045f
6417422
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
---
tags:
  - feature-extraction
  - sentence-similarity
  - mteb
language: en
inference: false
license: apache-2.0
---
<!-- TODO: add evaluation results here -->
<br><br>

<p align="center">
<img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
</p>


<p align="center">
<b>The text embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
</p>

## Quick Start

The easiest way to starting using `jina-clip-v1` is to use Jina AI's [Embedding API](https://jina.ai/embeddings/).

## Intended Usage & Model Info

`jina-clip-v1` is an English, monolingual **multimodal (text-image) embedding model**.

Traditional text embedding models, such as [jina-embeddings-v2-base-en](https://huggingface.co./jinaai/jina-embeddings-v2-base-en),
excel in text-to-text retrieval but lack cross-modal retrieval capabilities.
Conversely, CLIP-like models, such as [openai/clip-vit-base-patch32](https://huggingface.co./openai/clip-vit-base-patch32),
align image embeddings with text embeddings but underperform in text-to-text retrieval due to their training methodology and context length limitations.

`jina-clip-v1` is an innovative **multimodal embedding model**.
Its text component achieves comparable performance to `jina-embeddings-v2-base-en` in text-to-text retrieval,
while the overall model delivers state-of-the-art performance in cross-modal retrieval tasks.
This makes it an ideal choice for multimodal retrieval-augmented generation (M-RAG) applications,
allowing for both text-to-text and text-to-image searches with a single model.


## Data & Parameters

Jina CLIP V1 [technical report]() coming soon.

## Usage

You can use Jina CLIP directly from transformers package.

```python
!pip install transformers
from transformers import AutoModel
from numpy.linalg import norm

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
model = AutoModel.from_pretrained('jinaai/jina-clip-v1')
text_embeddings = model.encode_text(['How is the weather today?', 'What is the current weather like today?'])
image_embeddings = model.encode_image(['raindrop.png'])
print(cos_sim(text_embeddings[0], text_embeddings[1])) # text embedding similarity
print(cos_sim(text_embeddings[0], image_embeddings[0])) # text-image cross-modal similarity
```

## Performance

## Contact

Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.

## Citation

If you find Jina CLIP useful in your research, please cite the following paper:

```console
TBD
```