Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models Paper • 2410.02740 • Published 14 days ago • 51
LVD-2M: A Long-take Video Dataset with Temporally Dense Captions Paper • 2410.10816 • Published 3 days ago • 15