@akhaliq on Hugging Face: "InternLM-XComposer2 Mastering Free-form Text-Image Composition and…"

Post

InternLM-XComposer2

Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

paper page: InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model (2401.16420)

Experimental results demonstrate the superiority of InternLM-XComposer2 based on InternLM2-7B in producing high-quality long-text multi-modal content and its exceptional vision-language understanding performance across various benchmarks, where it not only significantly outperforms existing multimodal models but also matches or even surpasses GPT-4V and Gemini Pro in certain assessments.

Join the conversation