Abstract
We present the "Law of Vision Representation" in multimodal large language models (MLLMs). It reveals a strong correlation between the combination of cross-modal alignment, correspondence in vision representation, and MLLM performance. We quantify the two factors using the cross-modal Alignment and Correspondence score (AC score). Through extensive experiments involving thirteen different vision representation settings and evaluations across eight benchmarks, we find that the AC score is linearly correlated to model performance. By leveraging this relationship, we are able to identify and train the optimal vision representation only, which does not require finetuning the language model every time, resulting in a 99.7% reduction in computational cost.
Community
We study how to connect the visual-representations to the performance of MLLM, and propose an AC policy to suggest which vision model we should use! 😉
Hi @chenfengx congrats on this work!
It would be great to update the pipeline_tag: text-generation
to pipeline_tag: image-text-to-text
in each of the model repositories, which is more appropriate for VLMs (models like LLaVa, Florence-2, PaliGemma etc are also using this tag).
This way people can discover them from https://huggingface.co./models?pipeline_tag=image-text-to-text.
Cheers!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs (2024)
- Are Bigger Encoders Always Better in Vision Large Models? (2024)
- ParGo: Bridging Vision-Language with Partial and Global Views (2024)
- EVLM: An Efficient Vision-Language Model for Visual Understanding (2024)
- A Single Transformer for Scalable Vision-Language Modeling (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Thanks and congrats on this work!@chenfengx
In your paper:
I have a specific question about the equation. Could you elaborate on how the embedding vector E_i^vis computed from the visual features F? And what is the specific value of visual features F here?
Thank you very much for your time and for sharing your valuable research with the community. I am looking forward to your response.
Thanks for your interest in our work!
Models citing this paper 26
Browse 26 models citing this paperDatasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper