metadata
license: apple-ascl
MobileCLIP CoreML Models
These are the CoreML models of MobileCLIP. For more details, refer to MobileCLIP on HuggingFace and MobileCLIP on GitHub.
The models are separated for each subarchitecture:
- MobileCLIP-S0: This subarchitecture is designed for lightweight and fast inference, making it suitable for edge devices with limited computational resources.
- MobileCLIP-S1: This subarchitecture offers a balance between model complexity and performance, providing a good trade-off for various applications.
- MobileCLIP-S2: This subarchitecture focuses on achieving higher accuracy, ideal for applications where performance can be slightly compromised for better results.
- MobileCLIP-B: This subarchitecture aims at delivering the highest possible accuracy, optimized for environments with ample computational resources.
Each subarchitecture contains a TextEncoder and ImageEncoder that are separated into CoreML models for each subarchitecture:
Model | CLIP Text | CLIP Image |
---|---|---|
MobileCLIP-S0 | clip_text_s0.mlpackage | clip_image_s0.mlpackage |
MobileCLIP-S1 | clip_text_s1.mlpackage | clip_image_s1.mlpackage |
MobileCLIP-S2 | clip_text_s2.mlpackage | clip_image_s2.mlpackage |
MobileCLIP-B | clip_text_B.mlpackage | clip_image_B.mlpackage |
For detailed implementation and architecture specifics, refer to the MobileCLIP GitHub repository.
CoreML Parameters:
Model | Input Name | Input Shape | Input DataType | Output Name | Output Shape | Output DataType |
---|---|---|---|---|---|---|
CLIP Text | input_text | (1,77) | INT32 | output_embeddings | (1,512) | FLOAT16 |
Model | Input Name | Input Width | Input Height | Input ColorSpace | Output Name | Output Shape | Output DataType |
---|---|---|---|---|---|---|---|
CLIP Image | input_image | 256 | 256 | RGB | output_embeddings | (1,512) | FLOAT16 |
These are example scripts for performing the conversion to CoreML