SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Paper β’ 2502.14786 β’ Published 8 days ago β’ 118
LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization Paper β’ 2502.13922 β’ Published 9 days ago β’ 25
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking Paper β’ 2502.02339 β’ Published 24 days ago β’ 22
Ovis2 Collection Our latest advancement in multi-modal large language models (MLLMs) β’ 8 items β’ Updated 12 days ago β’ 52
VideoLLaMA3 Collection Frontier Multimodal Foundation Models for Video Understanding β’ 14 items β’ Updated 22 days ago β’ 13
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding Paper β’ 2501.13106 β’ Published Jan 22 β’ 83
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding Paper β’ 2501.12380 β’ Published Jan 21 β’ 83
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models Paper β’ 2501.03262 β’ Published Jan 4 β’ 90
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining Paper β’ 2501.00958 β’ Published Jan 1 β’ 99
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM Paper β’ 2501.00599 β’ Published Dec 31, 2024 β’ 41
PixMo Collection A set of vision-language datasets built by Ai2 and used to train the Molmo family of models. Read more at https://molmo.allenai.org/blog β’ 9 items β’ Updated 18 days ago β’ 64
Inf-CL Collection The corresponding demos/checkpoints/papers/datasets of Inf-CL. β’ 2 items β’ Updated Jan 24 β’ 3
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models Paper β’ 2410.23266 β’ Published Oct 30, 2024 β’ 20
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss Paper β’ 2410.17243 β’ Published Oct 22, 2024 β’ 89
Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective Paper β’ 2410.12490 β’ Published Oct 16, 2024 β’ 8