DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception Paper • 2407.08303 • Published Jul 11 • 17
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model Paper • 2407.07053 • Published Jul 9 • 41
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models Paper • 2407.07895 • Published Jul 10 • 40
EVLM: An Efficient Vision-Language Model for Visual Understanding Paper • 2407.14177 • Published Jul 19 • 42
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model Paper • 2407.16198 • Published Jul 23 • 13