CLIP Model based on DistilBERT and ViT

This repository contains a CLIP (Contrastive Language-Image Pretraining) model that combines the power of two state-of-the-art architectures:

  • DistilBERT (based on distilbert-base-uncased): A smaller, faster, and lighter version of BERT.
  • Vision Transformer (ViT) (based on google/vit-base-patch16-224): A powerful vision transformer architecture for image processing.

The model is trained to learn joint representations of images and text, enabling a variety of multimodal tasks such as image-text matching, zero-shot classification, and cross-modal retrieval.

Model Overview

CLIP combines a text encoder and an image encoder to map both images and texts into a shared embedding space. By training the model on a large number of image-text pairs, it can perform various downstream tasks without needing task-specific fine-tuning.

Components:

  • Text Encoder: distilbert-base-uncased is used to encode the textual input into a dense vector.
  • Image Encoder: google/vit-base-patch16-224 processes image data by dividing images into patches and learning their contextual relationships.

Future work:

Train over larger datasets and with more computer resources

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for sebastiansarasti/clip_fashion

Finetuned
(7675)
this model

Dataset used to train sebastiansarasti/clip_fashion

Space using sebastiansarasti/clip_fashion 1