visual embedding and text embedding projection

by MonoLeon - opened 2 days ago

2 days ago

Hi, I want to use Oryx-ViT as a vision encoder for my videos and conduct a test search (video search using text)on them. How can I align video and text embeddings in the same latent space?

THUdyh

Owner 1 day ago

Hi, we train Oryx-ViT to make it serve as a strong visual encoder for Multimodal Learning. However, even if we start from siglip, I do not think the final version of oryx-vit still have comparable performance on connecting text and images.

MonoLeon

about 20 hours ago

Hi, we train Oryx-ViT to make it serve as a strong visual encoder for Multimodal Learning. However, even if we start from siglip, I do not think the final version of oryx-vit still have comparable performance on connecting text and images.

Thanks for the clarification!

MonoLeon changed discussion status to closed about 20 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment