matlok
's Collections
Papers - Multimodal
updated
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Paper
•
2402.14289
•
Published
•
19
ImageBind: One Embedding Space To Bind Them All
Paper
•
2305.05665
•
Published
•
3
DocLLM: A layout-aware generative language model for multimodal document
understanding
Paper
•
2401.00908
•
Published
•
181
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture
of Experts
Paper
•
2206.02770
•
Published
•
3
Paper
•
2104.03964
•
Published
•
2
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper
•
2403.07508
•
Published
•
75
Veagle: Advancements in Multimodal Representation Learning
Paper
•
2403.08773
•
Published
•
7
mPLUG-Owl: Modularization Empowers Large Language Models with
Multimodality
Paper
•
2304.14178
•
Published
•
2
Gemini: A Family of Highly Capable Multimodal Models
Paper
•
2312.11805
•
Published
•
45
Flamingo: a Visual Language Model for Few-Shot Learning
Paper
•
2204.14198
•
Published
•
14
Training Compute-Optimal Large Language Models
Paper
•
2203.15556
•
Published
•
10
Paper
•
2309.16609
•
Published
•
34
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Paper
•
2402.12226
•
Published
•
40
Unifying Vision, Text, and Layout for Universal Document Processing
Paper
•
2212.02623
•
Published
•
10
Uni-SMART: Universal Science Multimodal Analysis and Research
Transformer
Paper
•
2403.10301
•
Published
•
51
VideoAgent: Long-form Video Understanding with Large Language Model as
Agent
Paper
•
2403.10517
•
Published
•
31
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Paper
•
2403.11703
•
Published
•
16
TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation
Paper
•
2403.12906
•
Published
•
5
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal
Large Language Models
Paper
•
2403.13447
•
Published
•
18
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual
Math Problems?
Paper
•
2403.14624
•
Published
•
51
A Multimodal Approach to Device-Directed Speech Detection with Large
Language Models
Paper
•
2403.14438
•
Published
•
2
InternVideo2: Scaling Video Foundation Models for Multimodal Video
Understanding
Paper
•
2403.15377
•
Published
•
22
Mini-Gemini: Mining the Potential of Multi-modality Vision Language
Models
Paper
•
2403.18814
•
Published
•
44
FormNetV2: Multimodal Graph Contrastive Learning for Form Document
Information Extraction
Paper
•
2305.02549
•
Published
•
6
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper
•
2308.12966
•
Published
•
6
LVLM-Intrepret: An Interpretability Tool for Large Vision-Language
Models
Paper
•
2404.03118
•
Published
•
23
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency
Determines Multimodal Model Performance
Paper
•
2404.04125
•
Published
•
27
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Paper
•
2404.05719
•
Published
•
80
Ferret-v2: An Improved Baseline for Referring and Grounding with Large
Language Models
Paper
•
2404.07973
•
Published
•
30
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper
•
2404.12390
•
Published
•
24
TokenPacker: Efficient Visual Projector for Multimodal LLM
Paper
•
2407.02392
•
Published
•
21
Data curation via joint example selection further accelerates multimodal
learning
Paper
•
2406.17711
•
Published
•
3
The Synergy between Data and Multi-Modal Large Language Models: A Survey
from Co-Development Perspective
Paper
•
2407.08583
•
Published
•
10
PaliGemma: A versatile 3B VLM for transfer
Paper
•
2407.07726
•
Published
•
66
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning
Instruction Using Language Model
Paper
•
2407.07053
•
Published
•
41
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality
Models
Paper
•
2407.11691
•
Published
•
13
Openstory++: A Large-scale Dataset and Benchmark for Instance-aware
Open-domain Visual Storytelling
Paper
•
2408.03695
•
Published
•
12
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal
Large Language Models
Paper
•
2408.04840
•
Published
•
31
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper
•
2408.08872
•
Published
•
97
Law of Vision Representation in MLLMs
Paper
•
2408.16357
•
Published
•
92