arxiv:2406.03496

Wings: Learning Multimodal LLMs without Text-only Forgetting

Published on Jun 5, 2024

Authors:

Abstract

Multimodal large language models (MLLMs), initiated with a trained LLM, first align images with text and then fine-tune on multimodal mixed inputs. However, the MLLM catastrophically forgets the <PRE_TAG>text-only instructions</POST_TAG>, which do not include images and can be addressed within the initial LLM. In this paper, we present Wings, a novel MLLM that excels in both text-only dialogues and multimodal comprehension. Analyzing MLLM attention in multimodal instructions reveals that text-only forgetting is related to the attention shifts from pre-image to post-image text. From that, we construct extra modules that act as the boosted learner to compensate for the attention shift. The complementary visual and <PRE_TAG>textual learners</POST_TAG>, like "wings" on either side, are connected in parallel within each layer's <PRE_TAG>attention block</POST_TAG>. Initially, image and text inputs are aligned with visual learners operating alongside the main attention, balancing focus on visual elements. Textual learners are later collaboratively integrated with <PRE_TAG>attention-based routing</POST_TAG> to blend the outputs of the visual and <PRE_TAG>textual learners</POST_TAG>. We design the Low-Rank Residual Attention (LoRRA) to guarantee high efficiency for learners. Our experimental results demonstrate that Wings outperforms equally-scaled MLLMs in both text-only and visual question-answering tasks. On a newly constructed Interleaved Image-Text (IIT) benchmark, Wings exhibits superior performance from text-only-rich to multimodal-rich question-answering tasks.