Multimodal - a minlik Collection

minlik 's Collections

LLM

IE

other

Multimodal

updated Sep 5

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

Paper • 2306.17107 • Published Jun 29, 2023 • 11
On the Hidden Mystery of OCR in Large Multimodal Models

Paper • 2305.07895 • Published May 13, 2023
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

Paper • 2308.12966 • Published Aug 24, 2023 • 6
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Paper • 2401.15947 • Published Jan 29 • 48
DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding

Paper • 2311.11810 • Published Nov 20, 2023 • 1
OCR-free Document Understanding Transformer

Paper • 2111.15664 • Published Nov 30, 2021 • 2
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

Paper • 2210.03347 • Published Oct 7, 2022 • 3
Reading Order Matters: Information Extraction from Visually-rich Documents by Token Path Prediction

Paper • 2310.11016 • Published Oct 17, 2023
Nougat: Neural Optical Understanding for Academic Documents

Paper • 2308.13418 • Published Aug 25, 2023 • 35
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks

Paper • 2403.00522 • Published Mar 1 • 44
MoAI: Mixture of All Intelligence for Large Language and Vision Models

Paper • 2403.07508 • Published Mar 12 • 75
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7 • 38
LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding

Paper • 2404.05225 • Published Apr 8
LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding

Paper • 2403.14252 • Published Mar 21
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

Paper • 2405.15738 • Published May 24 • 43
CogVLM: Visual Expert for Pretrained Language Models

Paper • 2311.03079 • Published Nov 6, 2023 • 23
Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Paper • 2405.20204 • Published May 30 • 32
OpenVLA: An Open-Source Vision-Language-Action Model

Paper • 2406.09246 • Published Jun 13 • 36
Unveiling Encoder-Free Vision-Language Models

Paper • 2406.11832 • Published Jun 17 • 49
ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild

Paper • 2407.04172 • Published Jul 4 • 22
LLaVA-OneVision: Easy Visual Task Transfer

Paper • 2408.03326 • Published Aug 6 • 59
Law of Vision Representation in MLLMs

Paper • 2408.16357 • Published Aug 29 • 92
CogVLM2: Visual Language Models for Image and Video Understanding

Paper • 2408.16500 • Published Aug 29 • 56