A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions Paper • 2312.08578 • Published Dec 14, 2023 • 16
Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts Paper • 2309.04354 • Published Sep 8, 2023 • 13