File size: 4,583 Bytes
e5614b5
 
 
 
7caab94
 
cfe92b7
7caab94
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7f58490
 
 
 
 
7caab94
 
 
 
 
 
 
 
 
 
 
c36d36f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7caab94
 
 
e5614b5
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
---
license: apple-ascl
---

## MobileCLIP CoreML Models

The models described here correspond to the CoreML conversion of the original MobileCLIP models from Apple. For more details, refer to [MobileCLIP on HuggingFace](https://huggingface.co./apple/mobileclip_b_timm) and [MobileCLIP on GitHub](https://github.com/apple/ml-mobileclip).

The models are separated for each subarchitecture:

- **MobileCLIP-S0**: This subarchitecture is designed for lightweight and fast inference, making it suitable for edge devices with limited computational resources.
- **MobileCLIP-S1**: This subarchitecture offers a balance between model complexity and performance, providing a good trade-off for various applications.
- **MobileCLIP-S2**: This subarchitecture focuses on achieving higher accuracy, ideal for applications where performance can be slightly compromised for better results.
- **MobileCLIP-B**: This subarchitecture aims at delivering the highest possible accuracy, optimized for environments with ample computational resources.

Each subarchitecture contains a TextEncoder and ImageEncoder that are separated into CoreML models for each subarchitecture:

| Model                                                     | CLIP Text                | CLIP Image                  |
|:----------------------------------------------------------|:-------------------------|:----------------------------|
| MobileCLIP-S0                                             | clip_text_s0.mlpackage   | clip_image_s0.mlpackage     |
| MobileCLIP-S1                                             | clip_text_s1.mlpackage   | clip_image_s1.mlpackage     |
| MobileCLIP-S2                                             | clip_text_s2.mlpackage   | clip_image_s2.mlpackage     |
| MobileCLIP-B                                              | clip_text_B.mlpackage    | clip_image_B.mlpackage      |

For detailed implementation and architecture specifics, refer to the [MobileCLIP GitHub repository](https://github.com/apple/ml-mobileclip).

## Example Usage

An example of using these CoreML models in a Swift application for iOS can be found in the [CLIP-Finder](https://github.com/fguzman82/CLIP-Finder2) project.


**CoreML Parameters:**


| Model    | Input Name   | Input Shape | Input DataType | Output Name        | Output Shape | Output DataType |
|:---------|:-------------|:------------|:---------------|:-------------------|:-------------|:----------------|
| CLIP Text| input_text   | (1,77)      | INT32          | output_embeddings  | (1,512)      | FLOAT16         |

| Model    | Input Name   | Input Width | Input Height | Input ColorSpace | Output Name        | Output Shape | Output DataType |
|:---------|:-------------|:------------|:-------------|:-----------------|:-------------------|:-------------|:----------------|
| CLIP Image| input_image | 256         | 256          | RGB              | output_embeddings  | (1,512)      | FLOAT16         |

## CoreML Profile (Benchmark) on Apple M1

| Prediction Times Apple M1 | CPU + ANE | CPU + GPU | CPU Only |
|:--------------------------|:---------:|:---------:|:--------:|
| clip_image_s0             | 1.4ms     | 7.4ms     | 12.7ms   |
| clip_image_s1             | 2.1ms     | 13.3ms    | 21.8ms   |
| clip_image_s2             | 3.0ms     | 19.0ms    | 28.5ms   |
| clip_image_b              | 12.4ms    | 36.2ms    | 38.1ms   |
|                           |           |           |          |
| clip_text_s0              | 1.1ms     | 4.1ms     | 4.8ms    |
| clip_text_s1              | 2.0ms     | 7.1ms     | 9.5ms    |
| clip_text_s2              | 2.0ms     | 7.1ms     | 10ms     |
| clip_text_b               | 2.0ms     | 7.2ms     | 9.8ms    |

The profile was conducted using this tool: [CoreMLProfiler](https://github.com/fguzman82/CoreMLProfiler).
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6427456313e5e91b24284353/rquK-eeZr5BmcG5b1LU6J.png)

*These are example scripts for performing the conversion to CoreML*

1. **CLIPImageModel to CoreML** [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ZHMzsJyAukBa4Jryv4Tmc_BOBmbQAjxf?usp=sharing)
   - This notebook demonstrates the process of converting a CLIP image model to CoreML format.

2. **CLIPTextModel to CoreML**  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1PxzB8M0h2bf-uYpw7fIZImpGSVXUI7Ie?usp=sharing)
   - This notebook demonstrates the process of converting a CLIP text model to CoreML format.