README.md · M-CLIP/XLM-Roberta-Large-Vit-B-32 at refs/pr/6

metadata

language:
  - multilingual
  - af
  - sq
  - am
  - ar
  - az
  - bn
  - bs
  - bg
  - ca
  - zh
  - hr
  - cs
  - da
  - nl
  - en
  - et
  - fr
  - de
  - el
  - hi
  - hu
  - is
  - id
  - it
  - ja
  - mk
  - ml
  - mr
  - pl
  - pt
  - ro
  - ru
  - sr
  - sl
  - es
  - sw
  - sv
  - tl
  - te
  - tr
  - tk
  - uk
  - ur
  - ug
  - uz
  - vi
  - xh

Multilingual-clip: XLM-Roberta-Large-Vit-B-32

Multilingual-CLIP extends OpenAI's English text encoders to multiple other languages. This model only contains the multilingual text encoder. The corresponding image model ViT-B-32 can be retrieved via instructions found on OpenAI's CLIP repository on Github. We provide a usage example below.

Requirements

To use both the multilingual text encoder and corresponding image encoder, we need to install the packages multilingual-clip and clip.

pip install multilingual-clip
pip install git+https://github.com/openai/CLIP.git

Usage

Extracting embeddings from the text encoder can be done in the following way:

from multilingual_clip import pt_multilingual_clip
import transformers

texts = [
    'Three blind horses listening to Mozart.',
    'Älgen är skogens konung!',
    'Wie leben Eisbären in der Antarktis?',
    'Вы знали, что все белые медведи левши?'
]
model_name = 'M-CLIP/XLM-Roberta-Large-Vit-B-32'

# Load Model & Tokenizer
model = pt_multilingual_clip.MultilingualCLIP.from_pretrained(model_name)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

embeddings = model.forward(texts, tokenizer)
print("Text features shape:", embeddings.shape)

Extracting embeddings from the corresponding image encoder:

import torch
import clip
import requests
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
image = preprocess(image).unsqueeze(0).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)

print("Image features shape:", image_features.shape)

Evaluation results

None of the M-CLIP models have been extensivly evaluated, but testing them on Txt2Img retrieval on the humanly translated MS-COCO dataset, we see the following R@10 results:

Name	En	De	Es	Fr	Zh	It	Pl	Ko	Ru	Tr	Jp
OpenAI CLIP Vit-B/32	90.3	-	-	-	-	-	-	-	-	-	-
OpenAI CLIP Vit-L/14	91.8	-	-	-	-	-	-	-	-	-	-
OpenCLIP ViT-B-16+-	94.3	-	-	-	-	-	-	-	-	-	-
LABSE Vit-L/14	91.6	89.6	89.5	89.9	88.9	90.1	89.8	80.8	85.5	89.8	73.9
XLM-R Large Vit-B/32	91.8	88.7	89.1	89.4	89.3	89.8	91.4	82.1	86.1	88.8	81.0
XLM-R Vit-L/14	92.4	90.6	91.0	90.0	89.7	91.1	91.3	85.2	85.8	90.3	81.9
XLM-R Large Vit-B/16+	95.0	93.0	93.6	93.1	94.0	93.1	94.4	89.0	90.0	93.0	84.2

Training/Model details

Further details about the model training and data can be found in the model card.