Add readme
Browse files
README.md
ADDED
@@ -0,0 +1,81 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: multilingual
|
3 |
+
---
|
4 |
+
|
5 |
+
## Multilingual-clip: XLM-Roberta-Large-Vit-B-16Plus
|
6 |
+
|
7 |
+
Multilingual-CLIP extends OpenAI's English text encoders to multiple other languages. This model *only* contains the multilingual text encoder. The corresponding image model `Vit-B-16Plus` can be retrieved via instructions found on `mlfoundations` [open_clip repository on Github](https://github.com/mlfoundations/open_clip). We provide a usage example below.
|
8 |
+
|
9 |
+
## Requirements
|
10 |
+
|
11 |
+
To use both the multilingual text encoder and corresponding image encoder, we need to install the packages [`multilingual-clip`](https://github.com/FreddeFrallan/Multilingual-CLIP) and [`open_clip_torch`](https://github.com/mlfoundations/open_clip).
|
12 |
+
|
13 |
+
```
|
14 |
+
pip install multilingual-clip
|
15 |
+
pip install open_clip_torch
|
16 |
+
```
|
17 |
+
|
18 |
+
## Usage
|
19 |
+
|
20 |
+
Extracting embeddings from the text encoder can be done in the following way:
|
21 |
+
|
22 |
+
```python
|
23 |
+
from multilingual_clip import pt_multilingual_clip
|
24 |
+
import transformers
|
25 |
+
|
26 |
+
texts = [
|
27 |
+
'Three blind horses listening to Mozart.',
|
28 |
+
'Älgen är skogens konung!',
|
29 |
+
'Wie leben Eisbären in der Antarktis?',
|
30 |
+
'Вы знали, что все белые медведи левши?'
|
31 |
+
]
|
32 |
+
model_name = 'XLM-Roberta-Large-Vit-B-16Plus'
|
33 |
+
|
34 |
+
# Load Model & Tokenizer
|
35 |
+
model = pt_multilingual_clip.MultilingualCLIP.from_pretrained(model_name)
|
36 |
+
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
|
37 |
+
|
38 |
+
embeddings = model.forward(texts, tokenizer)
|
39 |
+
print("Text features shape:", embeddings.shape)
|
40 |
+
```
|
41 |
+
|
42 |
+
Extracting embeddings from the corresponding image encoder:
|
43 |
+
|
44 |
+
```python
|
45 |
+
import torch
|
46 |
+
import open_clip
|
47 |
+
import requests
|
48 |
+
from PIL import Image
|
49 |
+
|
50 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
51 |
+
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-16-plus-240')
|
52 |
+
model.to(device)
|
53 |
+
|
54 |
+
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
55 |
+
image = Image.open(requests.get(url, stream=True).raw)
|
56 |
+
image = preprocess(image).unsqueeze(0).to(device)
|
57 |
+
|
58 |
+
with torch.no_grad():
|
59 |
+
image_features = model.encode_image(image)
|
60 |
+
|
61 |
+
print("Image features shape:", image_features.shape)
|
62 |
+
```
|
63 |
+
|
64 |
+
## Evaluation results
|
65 |
+
|
66 |
+
None of the M-CLIP models have been extensivly evaluated, but testing them on Txt2Img retrieval on the humanly translated MS-COCO dataset, we see the following **R@10** results:
|
67 |
+
|
68 |
+
| Name | En | De | Es | Fr | Zh | It | Pl | Ko | Ru | Tr | Jp |
|
69 |
+
| ----------------------------------|:-----: |:-----: |:-----: |:-----: | :-----: |:-----: |:-----: |:-----: |:-----: |:-----: |:-----: |
|
70 |
+
| [OpenAI CLIP Vit-B/32](https://github.com/openai/CLIP)| 90.3 | - | - | - | - | - | - | - | - | - | - |
|
71 |
+
| [OpenAI CLIP Vit-L/14](https://github.com/openai/CLIP)| 91.8 | - | - | - | - | - | - | - | - | - | - |
|
72 |
+
| [OpenCLIP ViT-B-16+-](https://github.com/openai/CLIP)| 94.3 | - | - | - | - | - | - | - | - | - | - |
|
73 |
+
| [LABSE Vit-L/14](https://huggingface.co/M-CLIP/LABSE-Vit-L-14)| 91.6 | 89.6 | 89.5 | 89.9 | 88.9 | 90.1 | 89.8 | 80.8 | 85.5 | 89.8 | 73.9 |
|
74 |
+
| [XLM-R Large Vit-B/32](https://huggingface.co/M-CLIP/XLM-Roberta-Large-Vit-B-32)| 91.8 | 88.7 | 89.1 | 89.4 | 89.3 | 89.8| 91.4 | 82.1 | 86.1 | 88.8 | 81.0 |
|
75 |
+
| [XLM-R Vit-L/14](https://huggingface.co/M-CLIP/XLM-Roberta-Large-Vit-L-14)| 92.4 | 90.6 | 91.0 | 90.0 | 89.7 | 91.1 | 91.3 | 85.2 | 85.8 | 90.3 | 81.9 |
|
76 |
+
| [XLM-R Large Vit-B/16+](https://huggingface.co/M-CLIP/XLM-Roberta-Large-Vit-B-16Plus)| **95.0** | **93.0** | **93.6** | **93.1** | **94.0** | **93.1** | **94.4** | **89.0** | **90.0** | **93.0** | **84.2** |
|
77 |
+
|
78 |
+
|
79 |
+
## Training/Model details
|
80 |
+
|
81 |
+
Further details about the model training and data can be found in the [model card](https://github.com/FreddeFrallan/Multilingual-CLIP/blob/main/larger_mclip.md).
|