s-emanuilov
commited on
Commit
•
267da46
1
Parent(s):
9c48517
Update README.md
Browse filesChanges to README.md
README.md
CHANGED
@@ -1,3 +1,90 @@
|
|
1 |
---
|
2 |
license: mit
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
+
language:
|
4 |
+
- en
|
5 |
---
|
6 |
+
# CLIP ViT-L/14 in OpenVINO™ format
|
7 |
+
|
8 |
+
## Original model details
|
9 |
+
|
10 |
+
The CLIP model was developed by OpenAI to learn about what contributes to robustness in computer vision tasks. The model was also developed to test the ability of models to generalize to arbitrary image classification tasks in a zero-shot manner. It was not developed for general model deployment - to deploy models like CLIP, researchers will first need to carefully study their capabilities in relation to the specific context they’re being deployed within.
|
11 |
+
|
12 |
+
## Model type
|
13 |
+
The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.
|
14 |
+
|
15 |
+
## OpenVINO™ optimization
|
16 |
+
|
17 |
+
To increase the efficiency of the model during inference, we utilized the OpenVINO™ toolkit for optimization. The table below showcases the inference time improvements achieved with OpenVINO™ compared to the original PyTorch implementation:
|
18 |
+
|
19 |
+
| Metric | PyTorch Inference Time (sec) | OpenVINO™ Inference Time (sec) | Similarity |
|
20 |
+
|:-------------------|-------------------------------:|---------------------------------:|-------------:|
|
21 |
+
| mean | 0.52 | 0.46 | 1 |
|
22 |
+
| std | 0.11 | 0.09 | 0 |
|
23 |
+
| min | 0.39 | 0.36 | 1 |
|
24 |
+
| max | 0.70 | 0.62 | 1 |
|
25 |
+
|
26 |
+
OpenVINO offers a 1.12x speedup in inference time compared to PyTorch. It was measured on same image in 100 iterations on Intel(R) Xeon(R) CPU @ 2.20GHz (CPU family: 6, Model: 79).
|
27 |
+
|
28 |
+
|
29 |
+
The results indicate that the OpenVINO™ optimization provides a consistent improvement in inference time while maintaining the same level of accuracy (as indicated by the similarity score).
|
30 |
+
|
31 |
+
## Usage
|
32 |
+
|
33 |
+
You can utilize this optimized model for faster inferences in environments where time is a critical factor. Ensure you have the necessary libraries and dependencies installed to leverage the usage of OpenVINO™.
|
34 |
+
|
35 |
+
```bash
|
36 |
+
pip install transformers huggingface_hub openvino-dev
|
37 |
+
```
|
38 |
+
|
39 |
+
Then use it for inference:
|
40 |
+
|
41 |
+
```python
|
42 |
+
import os
|
43 |
+
|
44 |
+
import numpy as np
|
45 |
+
from PIL import Image
|
46 |
+
from huggingface_hub import snapshot_download
|
47 |
+
from openvino.runtime import Core
|
48 |
+
from scipy.special import softmax
|
49 |
+
from transformers import CLIPProcessor
|
50 |
+
|
51 |
+
# Download the OV model
|
52 |
+
ov_path = snapshot_download(repo_id="scaleflex/clip-vit-large-patch14-openvino")
|
53 |
+
# Load preprocessor for model input
|
54 |
+
processor = CLIPProcessor.from_pretrained("scaleflex/clip-vit-large-patch14-openvino")
|
55 |
+
ov_model_xml = os.path.join(ov_path, "clip-vit-large-patch14.xml")
|
56 |
+
|
57 |
+
image = Image.open("face.png") # download this example image: http://sample.li/face.png
|
58 |
+
input_labels = [
|
59 |
+
"businessman",
|
60 |
+
"dog playing in the garden",
|
61 |
+
"beautiful woman",
|
62 |
+
"big city",
|
63 |
+
"lake in the mountain",
|
64 |
+
]
|
65 |
+
text_descriptions = [f"This is a photo of a {label}" for label in input_labels]
|
66 |
+
inputs = processor(
|
67 |
+
text=text_descriptions, images=[image], return_tensors="pt", padding=True
|
68 |
+
)
|
69 |
+
|
70 |
+
# Create OpenVINO core object instance
|
71 |
+
core = Core()
|
72 |
+
|
73 |
+
ov_model = core.read_model(model=ov_model_xml)
|
74 |
+
# Compile model for loading on device
|
75 |
+
compiled_model = core.compile_model(ov_model)
|
76 |
+
# Obtain output tensor for getting predictions
|
77 |
+
logits_per_image_out = compiled_model.output(0)
|
78 |
+
# Run inference on preprocessed data and get image-text similarity score
|
79 |
+
ov_logits_per_image = compiled_model(dict(inputs))[logits_per_image_out]
|
80 |
+
# Perform softmax on score
|
81 |
+
probs = softmax(ov_logits_per_image, axis=1)
|
82 |
+
max_index = np.argmax(probs)
|
83 |
+
|
84 |
+
# Use the index to get the corresponding label
|
85 |
+
label_with_max_prob = input_labels[max_index]
|
86 |
+
print(
|
87 |
+
f"The label with the highest probability is: '{label_with_max_prob}' with a probability of {probs[0][max_index] * 100:.2f}%"
|
88 |
+
)
|
89 |
+
# The label with the highest probability is: 'beautiful woman' with a probability of 97.87%
|
90 |
+
```
|