Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Paper • 2409.12191 • Published 1 day ago • 47
Single-Layer Learnable Activation for Implicit Neural Representation (SL^{2}A-INR) Paper • 2409.10836 • Published 3 days ago • 4
SplatFields: Neural Gaussian Splats for Sparse 3D and 4D Reconstruction Paper • 2409.11211 • Published 3 days ago • 6
Robust Dual Gaussian Splatting for Immersive Human-centric Volumetric Videos Paper • 2409.08353 • Published 7 days ago • 9
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture Paper • 2409.02889 • Published 16 days ago • 53
VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges Paper • 2409.01071 • Published 18 days ago • 26
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders Paper • 2408.15998 • Published 22 days ago • 81
FLoD: Integrating Flexible Level of Detail into 3D Gaussian Splatting for Customizable Rendering Paper • 2408.12894 • Published 28 days ago • 3
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model Paper • 2408.11039 • Published about 1 month ago • 54
SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views Paper • 2408.10195 • Published Aug 19 • 12
LongVILA: Scaling Long-Context Visual Language Models for Long Videos Paper • 2408.10188 • Published Aug 19 • 51
Surgical SAM 2: Real-time Segment Anything in Surgical Video by Efficient Frame Pruning Paper • 2408.07931 • Published Aug 15 • 17
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models Paper • 2408.08872 • Published Aug 16 • 96
ControlNeXt: Powerful and Efficient Control for Image and Video Generation Paper • 2408.06070 • Published Aug 12 • 52
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI Paper • 2408.03361 • Published Aug 6 • 85
Compact 3D Gaussian Splatting for Static and Dynamic Radiance Fields Paper • 2408.03822 • Published Aug 7 • 9
RayGauss: Volumetric Gaussian-Based Ray Casting for Photorealistic Novel View Synthesis Paper • 2408.03356 • Published Aug 6 • 8
MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine Paper • 2408.02900 • Published Aug 6 • 25
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models Paper • 2408.02085 • Published Aug 4 • 17
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts Paper • 2407.21770 • Published Jul 31 • 20
Tora: Trajectory-oriented Diffusion Transformer for Video Generation Paper • 2407.21705 • Published Jul 31 • 25
Theia: Distilling Diverse Vision Foundation Models for Robot Learning Paper • 2407.20179 • Published Jul 29 • 45
SHIC: Shape-Image Correspondences with no Keypoint Supervision Paper • 2407.18907 • Published Jul 26 • 38
BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation Paper • 2407.17952 • Published Jul 25 • 27
OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person Paper • 2407.16224 • Published Jul 23 • 23
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model Paper • 2407.16198 • Published Jul 23 • 13
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding Paper • 2407.15754 • Published Jul 22 • 19
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models Paper • 2407.15841 • Published Jul 22 • 38
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models Paper • 2407.12772 • Published Jul 17 • 32
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models Paper • 2407.07895 • Published Jul 10 • 40
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions Paper • 2407.06723 • Published Jul 9 • 10
VIMI: Grounding Video Generation through Multi-modal Instruction Paper • 2407.06304 • Published Jul 8 • 9
MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions Paper • 2407.06358 • Published Jul 8 • 17
RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models Paper • 2407.06938 • Published Jul 9 • 21
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion Paper • 2407.01392 • Published Jul 1 • 39
Consistency Flow Matching: Defining Straight Flows with Velocity Consistency Paper • 2407.02398 • Published Jul 2 • 14
Wavelets Are All You Need for Autoregressive Image Generation Paper • 2406.19997 • Published Jun 28 • 28
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model Paper • 2406.20076 • Published Jun 28 • 8
HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale Paper • 2406.19280 • Published Jun 27 • 59
Invertible Consistency Distillation for Text-Guided Image Editing in Around 7 Steps Paper • 2406.14539 • Published Jun 20 • 26
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding Paper • 2406.14515 • Published Jun 20 • 32
HumanSplat: Generalizable Single-Image Human Gaussian Splatting with Structure Priors Paper • 2406.12459 • Published Jun 18 • 11
VideoLLM-online: Online Video Large Language Model for Streaming Video Paper • 2406.11816 • Published Jun 17 • 20
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling Paper • 2406.07522 • Published Jun 11 • 36
Hibou: A Family of Foundational Vision Transformers for Pathology Paper • 2406.05074 • Published Jun 7 • 6