arxiv:2410.18798

Distill Visual Chart Reasoning Ability from LLMs to MLLMs

Published on Oct 24

· Submitted by

WooooDyy on Oct 25

Upvote

Authors:

Wei He ,

Zhiheng Xi ,

Wanxu Zhao ,

Xiaoran Fan ,

Zifei Shan ,

Tao Gui ,

Abstract

Solving complex chart Q&A tasks requires advanced visual reasoning abilities in multimodal large language models (MLLMs). Recent studies highlight that these abilities consist of two main parts: recognizing key information from visual inputs and conducting reasoning over it. Thus, a promising approach to enhance MLLMs is to construct relevant training data focusing on the two aspects. However, collecting and annotating complex charts and questions is costly and time-consuming, and ensuring the quality of annotated answers remains a challenge. In this paper, we propose Code-as-Intermediary Translation (CIT), a cost-effective, efficient and easily scalable data synthesis method for distilling visual reasoning abilities from LLMs to MLLMs. The code serves as an intermediary that translates visual chart representations into textual representations, enabling LLMs to understand cross-modal information. Specifically, we employ text-based synthesizing techniques to construct chart-plotting code and produce ReachQA, a dataset containing 3k reasoning-intensive charts and 20k Q&A pairs to enhance both recognition and reasoning abilities. Experiments show that when fine-tuned with our data, models not only perform well on chart-related benchmarks, but also demonstrate improved multimodal reasoning abilities on general mathematical benchmarks like MathVista. The code and dataset are publicly available at https://github.com/hewei2001/ReachQA.

View arXiv page View PDF Add to collection

Community

WooooDyy

Paper author Paper submitter 12 days ago

A cost-effective, efficient and easily scalable data synthesis method for distilling visual reasoning abilities from LLMs to MLLMs.

librarian-bot

11 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

zwq2018

11 days ago

•

edited 11 days ago

I am very curious.
What are the differences between this paper and ChartLlama and Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model (https://arxiv.org/abs/2407.07053, emnlp24)?

hewei2001

Paper author 11 days ago

❤Thank you for your question and interest in our paper! The works you mentioned are both valuable contributions to the field of MLLMs. (And congratulations on the acceptance to emnlp!🥳)

In response to your question, there are key differences between these papers:

✨️Difference with ChartLlama:

Motivation and Focus: While both ChartLlama and our work focus on building training datasets for chart-based tasks, ChartLlama is a pioneering effort that leaves room for improvement in several areas. In our work, we prioritize enhancing chart diversity and visual complexity through Self-Instruct and Evol-Instruct techniques. Additionally, by using code as an intermediary instead of data tables for synthetic Q&A generation, we achieve a higher-quality instruction dataset (see Sections 2.2 and 5.1).
Experimental Results: Since ChartLlama was published earlier and did not release its dataset, we include comparisons with its subsequent works (ChartAst and ChartInstruct), which further demonstrate the effectiveness of our approach.

✨️Difference with Multimodal Self-Instruct:

Motivation and Focus: Our work is motivated by enhancing two main aspects of visual reasoning abilities, particularly in chart-based scenarios. Unlike Multimodal Self-Instruct, which primarily benchmarks existing model capabilities in understanding abstract images from daily scenarios, we focus on prior chart-related works and introduce a new data synthesis method to generate training data.
Methodology: Beyond the use of Self-Instruct, we also incorporate Evol-Instruct to generate visually complex chart images, including multiple plots and overlay plots. This helps to create reasoning-intensive charts and challenging Q&A pairs. In designing the instruction data, we also target improvements in two key areas: recognition and reasoning.
Scope of Analysis: To validate our dataset, we trained multiple MLLMs and evaluated them on various open benchmarks. Our results also demonstrate improvements in reasoning-intensive chart tasks and general multimodal reasoning tasks. Additionally, we analyze the effects of different data ratios and the impact of mixing general-purpose data, offering insights into effective training set composition.

📌By the way, it is also worth noting that Multimodal Self-Instruct was first posted on arXiv roughly two or three months before our paper’s submission, which may suggest they are concurrent works exploring similar research directions.

Thank you again for your thoughtful question. I hope this clarifies the distinctions and contributions of our work.