Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity Paper • 2407.10387 • Published Jul 15, 2024 • 6
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models Paper • 2411.04996 • Published Nov 7, 2024 • 50
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos Paper • 2501.04001 • Published 4 days ago • 34