arxiv:2408.16357

Law of Vision Representation in MLLMs

Published on Aug 29

· Submitted by

chenfengx on Aug 30

#1 Paper of the day

Upvote

Authors:

Shijia Yang ,

Bohan Zhai ,

Quanzeng You ,

Jianbo Yuan ,

Chenfeng Xu

Abstract

We present the "Law of Vision Representation" in multimodal large language models (MLLMs). It reveals a strong correlation between the combination of cross-modal alignment, correspondence in vision representation, and MLLM performance. We quantify the two factors using the cross-modal Alignment and Correspondence score (AC score). Through extensive experiments involving thirteen different vision representation settings and evaluations across eight benchmarks, we find that the AC score is linearly correlated to model performance. By leveraging this relationship, we are able to identify and train the optimal vision representation only, which does not require finetuning the language model every time, resulting in a 99.7% reduction in computational cost.

View arXiv page View PDF Add to collection

Community

chenfengx

Paper author Paper submitter Aug 30

We study how to connect the visual-representations to the performance of MLLM, and propose an AC policy to suggest which vision model we should use! 😉

nielsr

Sep 2

Hi @chenfengx congrats on this work!

It would be great to update the pipeline_tag: text-generation to pipeline_tag: image-text-to-text in each of the model repositories, which is more appropriate for VLMs (models like LLaVa, Florence-2, PaliGemma etc are also using this tag).

This way people can discover them from https://huggingface.co./models?pipeline_tag=image-text-to-text.

Cheers!

librarian-bot

Aug 31

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

skycyou

Sep 25

•

edited Sep 25

Thanks and congrats on this work!@chenfengx
In your paper:

$\text{A SCORE} = \frac{1}{n} \sum_{i=0}^n \max_{u, v} S_{c}(\hat{E}_i^{(u)}, E_i^{(v)})$
I have a specific question about the equation. Could you elaborate on how the embedding vector E_i^vis computed from the visual features F? And what is the specific value of visual features F here?