visual embedding and text embedding projection

#1
by MonoLeon - opened

Hi, I want to use Oryx-ViT as a vision encoder for my videos and conduct a test search (video search using text)on them. How can I align video and text embeddings in the same latent space?

Hi, we train Oryx-ViT to make it serve as a strong visual encoder for Multimodal Learning. However, even if we start from siglip, I do not think the final version of oryx-vit still have comparable performance on connecting text and images.

Hi, we train Oryx-ViT to make it serve as a strong visual encoder for Multimodal Learning. However, even if we start from siglip, I do not think the final version of oryx-vit still have comparable performance on connecting text and images.

Thanks for the clarification!

MonoLeon changed discussion status to closed

Sign up or log in to comment