LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding Paper • 2306.17107 • Published Jun 29, 2023 • 11
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities Paper • 2308.12966 • Published Aug 24, 2023 • 6
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models Paper • 2401.15947 • Published Jan 29 • 48
DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding Paper • 2311.11810 • Published Nov 20, 2023 • 1
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding Paper • 2210.03347 • Published Oct 7, 2022 • 3
Reading Order Matters: Information Extraction from Visually-rich Documents by Token Path Prediction Paper • 2310.11016 • Published Oct 17, 2023
Nougat: Neural Optical Understanding for Academic Documents Paper • 2308.13418 • Published Aug 25, 2023 • 35
MoAI: Mixture of All Intelligence for Large Language and Vision Models Paper • 2403.07508 • Published Mar 12 • 75
ScreenAI: A Vision-Language Model for UI and Infographics Understanding Paper • 2402.04615 • Published Feb 7 • 38
LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding Paper • 2404.05225 • Published Apr 8
LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding Paper • 2403.14252 • Published Mar 21
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models Paper • 2405.15738 • Published May 24 • 43
ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild Paper • 2407.04172 • Published Jul 4 • 22
CogVLM2: Visual Language Models for Image and Video Understanding Paper • 2408.16500 • Published Aug 29 • 56