README.md · jinaai/jina-clip-v1 at 1f9045f490151b350bf9fc0c706a0a51b11c3e11

metadata

tags:
  - feature-extraction
  - sentence-similarity
  - mteb
language: en
inference: false
license: apache-2.0

Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications.

The text embedding set trained by Jina AI.

Quick Start

The easiest way to starting using jina-clip-v1 is to use Jina AI's Embedding API.

Intended Usage & Model Info

jina-clip-v1 is an English, monolingual multimodal (text-image) embedding model.

Traditional text embedding models, such as jina-embeddings-v2-base-en, excel in text-to-text retrieval but lack cross-modal retrieval capabilities. Conversely, CLIP-like models, such as openai/clip-vit-base-patch32, align image embeddings with text embeddings but underperform in text-to-text retrieval due to their training methodology and context length limitations.

jina-clip-v1 is an innovative multimodal embedding model. Its text component achieves comparable performance to jina-embeddings-v2-base-en in text-to-text retrieval, while the overall model delivers state-of-the-art performance in cross-modal retrieval tasks. This makes it an ideal choice for multimodal retrieval-augmented generation (M-RAG) applications, allowing for both text-to-text and text-to-image searches with a single model.

Data & Parameters

Jina CLIP V1 technical report coming soon.

Usage

You can use Jina CLIP directly from transformers package.

!pip install transformers
from transformers import AutoModel
from numpy.linalg import norm

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
model = AutoModel.from_pretrained('jinaai/jina-clip-v1')
text_embeddings = model.encode_text(['How is the weather today?', 'What is the current weather like today?'])
image_embeddings = model.encode_image(['raindrop.png'])
print(cos_sim(text_embeddings[0], text_embeddings[1])) # text embedding similarity
print(cos_sim(text_embeddings[0], image_embeddings[0])) # text-image cross-modal similarity

Performance

Contact

Join our Discord community and chat with other community members about ideas.

Citation

If you find Jina CLIP useful in your research, please cite the following paper:

TBD