license: apache-2.0
Chinese-CLIP-Base
Introduction
This is the base-version of the Chinese CLIP. Chinese CLIP is a simple implementation of CLIP on a large-scale dataset of around 200 million Chinese image-text pairs. For more details, please refer to our technical report https://arxiv.org/abs/2211.01335 and our official github repo https://github.com/OFA-Sys/Chinese-CLIP
How to use
We provide a simple code snippet to show how to use the API for Chinese-CLIP. For starters, please install cn_clip:
# to install the latest stable release
pip install cn_clip
# or install from source code
cd Chinese-CLIP
pip install -e .
After installation, use Chinese CLIP as shown below:
import torch
from PIL import Image
import cn_clip.clip as clip
from cn_clip.clip import load_from_name, available_models
print("Available models:", available_models())
# Available models: ['ViT-B-16', 'ViT-L-14', 'ViT-L-14-336', 'ViT-H-14', 'RN50']
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = load_from_name("ViT-B-16", device=device, download_root='./')
model.eval()
image = preprocess(Image.open("examples/pokemon.jpeg")).unsqueeze(0).to(device)
text = clip.tokenize(["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# Normalize the features. Please use the normalized features for downstream tasks.
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
logits_per_image, logits_per_text = model.get_similarity(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
print("Label probs:", probs) # [[1.268734e-03 5.436878e-02 6.795761e-04 9.436829e-01]]
However, if you are not satisfied with only using the API, feel free to check our github repo https://github.com/OFA-Sys/Chinese-CLIP for more details about training and inference.
Results
MUGE Text-to-Image Retrieval
Setup | Zero-shot | Finetune | ||||||
---|---|---|---|---|---|---|---|---|
Metric | R@1 | R@5 | R@10 | MR | R@1 | R@5 | R@10 | MR |
WukongViT-B | 33.4 | 59.3 | 69.7 | 54.1 | 39.2 | 66.9 | 77.4 | 61.2 |
R2D2ViT-B | - | - | - | - | 47.4 | 75.1 | 83.5 | 68.7 |
CN-CLIPViT-B | 52.1 | 76.7 | 84.4 | 71.1 | 58.4 | 83.6 | 90.0 | 77.4 |
Flickr30K-CN Retrieval
Task | Text-to-Image | Image-to-Text | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Setup | Zero-shot | Finetune | Zero-shot | Finetune | ||||||||
Metric | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 |
WukongViT-B | 45.7 | 73.8 | 82.2 | 67.6 | 89.6 | 94.2 | 66.2 | 88.7 | 94.3 | 83.9 | 97.6 | 99.0 |
R2D2ViT-B | - | - | - | 78.3 | 94.6 | 97.0 | - | - | - | 92.6 | 99.1 | 99.8 |
CN-CLIPViT-B | 62.7 | 86.9 | 92.8 | 79.1 | 94.8 | 97.4 | 74.6 | 93.5 | 97.1 | 93.5 | 99.0 | 99.5 |
COCO-CN Retrieval
Task | Text-to-Image | Image-to-Text | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Setup | Zero-shot | Finetune | Zero-shot | Finetune | ||||||||
Metric | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 |
WukongViT-B | 49.2 | 79.4 | 87.9 | 67.0 | 91.4 | 96.7 | 48.3 | 77.8 | 88.8 | 65.8 | 90.3 | 96.6 |
R2D2ViT-B | - | - | - | 75.1 | 94.2 | 98.1 | - | - | - | 76.1 | 95.3 | 98.5 |
CN-CLIPViT-B | 62.2 | 86.6 | 94.9 | 77.0 | 97.1 | 99.0 | 57.0 | 84.1 | 93.6 | 77.4 | 96.2 | 98.9 |
Citation
If you find Chinese CLIP helpful, feel free to cite our paper. Thanks for your support!
@article{chinese-clip,
title={Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese},
author={Yang, An and Pan, Junshu and Lin, Junyang and Men, Rui and Zhang, Yichang and Zhou, Jingren and Zhou, Chang},
journal={arXiv preprint arXiv:2211.01335},
year={2022}
}