General Multimodal Learning - a che111 Collection

che111 's Collections

Work for 3D Medical Vision

Med Multimodal Learning

Localize Viusal Understanding

Generative Model

Synthetic Data Learning

Explaniable, Fairness Work

General Multimodal Learning

General Multimodal Learning

updated 5 days ago

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

Paper • 2401.14405 • Published Jan 25 • 12
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

Paper • 2406.18521 • Published Jun 26 • 28
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Paper • 2408.12590 • Published Aug 22 • 35
Law of Vision Representation in MLLMs

Paper • 2408.16357 • Published Aug 29 • 92
CogVLM2: Visual Language Models for Image and Video Understanding

Paper • 2408.16500 • Published Aug 29 • 56
Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22 • 124
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

Paper • 2409.12961 • Published Sep 19 • 24
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Paper • 2410.02740 • Published Oct 3 • 52
Video Instruction Tuning With Synthetic Data

Paper • 2410.02713 • Published Oct 3 • 38
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

Paper • 2410.03051 • Published Oct 4 • 4
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

Paper • 2410.03290 • Published Oct 4 • 6
A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

Paper • 2410.01912 • Published Oct 2 • 13
MIO: A Foundation Model on Multimodal Tokens

Paper • 2409.17692 • Published Sep 26 • 52
Emu3: Next-Token Prediction is All You Need

Paper • 2409.18869 • Published Sep 27 • 93
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Paper • 2409.20566 • Published Sep 30 • 53
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Paper • 2410.13848 • Published Oct 17 • 31
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Paper • 2408.15998 • Published Aug 28 • 84
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Paper • 2411.04923 • Published Nov 7 • 20
VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval

Paper • 2412.01558 • Published 23 days ago • 4
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

Paper • 2412.02611 • Published 22 days ago • 22
VisionZip: Longer is Better but Not Necessary in Vision Language Models

Paper • 2412.04467 • Published 19 days ago • 104
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Paper • 2412.04424 • Published 19 days ago • 55
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions

Paper • 2412.08737 • Published 13 days ago • 51
Multimodal Latent Language Modeling with Next-Token Diffusion

Paper • 2412.08635 • Published 13 days ago • 41
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation

Paper • 2412.09585 • Published 12 days ago • 10
InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Paper • 2412.09283 • Published 13 days ago • 19
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

Paper • 2412.14475 • Published 6 days ago • 51
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

Paper • 2412.15204 • Published 5 days ago • 30
Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

Paper • 2412.14233 • Published 6 days ago • 6