visual embedding and text embedding projection
#1
by
MonoLeon
- opened
Hi, I want to use Oryx-ViT as a vision encoder for my videos and conduct a test search (video search using text)on them. How can I align video and text embeddings in the same latent space?
Hi, we train Oryx-ViT to make it serve as a strong visual encoder for Multimodal Learning. However, even if we start from siglip, I do not think the final version of oryx-vit still have comparable performance on connecting text and images.
Hi, we train Oryx-ViT to make it serve as a strong visual encoder for Multimodal Learning. However, even if we start from siglip, I do not think the final version of oryx-vit still have comparable performance on connecting text and images.
Thanks for the clarification!
MonoLeon
changed discussion status to
closed