AskVideos-VideoCLIPv0.1

Like it's image-only counterpart, CLIP, VideoCLIP enables you to compute a single embedding for videos that can be used to compute similarity with text.

VideoCLIP uses a Video Q-Former to aggregate frame-level embeddings temporally into a single embedding, maintaining relevance of the underlying content. The resulting embedding is then trained with contrastive loss + captioning loss to match it's corresponding text.

Usage

Link to github to run the model: link.

# Load model.
import video_clip
eval_config = 'eval_configs/video_clip.yaml'
model, vis_processor = video_clip.load_model(eval_config)

# Compute video embeddings.
# video_embs: float matrix of size [num_videos, clip_dim_size, query_tokens] containing VideoCLIP embeddings.
# In this model, clip_dim_size=1024 and query_tokens=32.
video_embs = video_clip.get_all_video_embeddings(videos, model, vis_processor)

# Compute Video-Text similarity.
# v2t_sim: float matrix of size [num_videos, num_texts] indicating similarity.
v2t_sim = video_clip.compute_sim(model, texts, video_embs)

# Compute Text-Video similarity.
# t2v_sim: float matrix of size [num_texts, num_videos] indicating similarity.
t2v_sim = v2t_sim.T

# Compute Video-Video distance.
# v2v_dists: float vector of size [1, num_videos] indicating distance to query video embedding.
v2v_dists = video_clip.compute_dist_videoq(model, video_embs[0], video_embs)

For a more detailed demo of how to use the model, see the colab.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.