EVEv2: Improved Baselines for Encoder-Free Vision-Language Models Paper • 2502.06788 • Published 27 days ago • 12
Scaling Pre-training to One Hundred Billion Data for Vision Language Models Paper • 2502.07617 • Published 26 days ago • 29
VideoRoPE: What Makes for Good Video Rotary Position Embedding? Paper • 2502.05173 • Published 30 days ago • 64
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Paper • 2502.14786 • Published 17 days ago • 128
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token Paper • 2501.03895 • Published Jan 7 • 50
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model Paper • 2501.12368 • Published Jan 21 • 42
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding Paper • 2501.13106 • Published Jan 22 • 84
Token-Efficient Long Video Understanding for Multimodal LLMs Paper • 2503.04130 • Published 3 days ago • 61