arxiv:2407.17453

VILA^2: VILA Augmented VILA

Published on Jul 24

· Submitted by

Seerkfang on Jul 25

#2 Paper of the day

Upvote

Authors:

Yunhao Fang ,

Ligeng Zhu ,

Yao Lu ,

Yan Wang ,

Pavlo Molchanov ,

Jang Hyun Cho ,

Song Han ,

Hongxu Yin

Abstract

Visual language models (VLMs) have rapidly progressed, driven by the success of large language models (LLMs). While model architectures and training infrastructures advance rapidly, data curation remains under-explored. When data quantity and quality become a bottleneck, existing work either directly crawls more raw data from the Internet that does not have a guarantee of data quality or distills from black-box commercial models (e.g., GPT-4V / Gemini) causing the performance upper bounded by that model. In this work, we introduce a novel approach that includes a self-augment step and a specialist-augment step to iteratively improve data quality and model performance. In the self-augment step, a VLM recaptions its own pretraining data to enhance data quality, and then retrains from scratch using this refined dataset to improve model performance. This process can iterate for several rounds. Once self-augmentation saturates, we employ several specialist VLMs finetuned from the self-augmented VLM with domain-specific expertise, to further infuse specialist knowledge into the generalist VLM through task-oriented recaptioning and retraining. With the combined self-augmented and specialist-augmented training, we introduce VILA^2 (VILA-augmented-VILA), a VLM family that consistently improves the accuracy on a wide range of tasks over prior art, and achieves new state-of-the-art results on MMMU leaderboard among open-sourced models.

View arXiv page View PDF Add to collection

Community

Seerkfang

Paper author Paper submitter Jul 25

This comment has been hidden

orrzohar

Jul 25

Hi @Seerkfang -- interesting work!
Two questions/comments:
a) Would you consider applying Video-STaR to your approach (effectively utilizing your auxiliary labels to filter generated responses)? I found that answers that contain labels (these can be pseudo-labels derived from your aux. models, and/or original image captions) reduce generated answer hallucination.
b) Are you familiar with visual programming distillation?

Seerkfang

Paper author Paper submitter Jul 25

•

edited Jul 25

Hi @orrzohar , thanks for your interest in our work!

Yes, we do care about alleviating hallucinations in VLMs. Our grounding specialist has the potential to ensure that VILA^2 responses are accurate and free from hallucinations. I'll read your paper later to explore any synergies. :)
Our papers focus on VLM augment VLM. Here are some points regarding the paper you mentioned:
① Multi-stage data processing can be challenging without a strong verifier (I have experience with LLM verification). For LLMs, we can use formal reasoning or an external interpreter for feedback, for VLMs, a "visible" judge is more reliable, and we are still working on it.
② Complex instruction-following is challenging for current VLMs, especially after so much SFT data targeting short benchmark answers.
③ In our appendix, we mentioned the challenge of performing bi-partite matching between detection results and detailed captions (ambiguities), with LLMs tending to introduce more hallucinations.

VILA^2 offers a practical solution for leveraging internet-scale raw visual data and provides a straightforward yet general learning approach for various models. Generalist VLMs can self-bootstrap and learn from specialist VLMs (new skills + better alignment). This method works for models at "any level" (hopefully) and can integrate diverse, task-/knowledge-specialized models, making it practical for real-world applications. We welcome your thoughts and invite you to stay tuned for our upcoming work.

orrzohar

Jul 26

Hi @Seerkfang ,
Thank you for your response! I really liked VILA^2, and just thought that maybe you could draw from Video-STaR/VDP additional inspiration and further improve performance even in later generation cycles. I agree that pre-training in general is massively under-explored in LMMs, with VILA/VILA^2 being notable outliers in the OSS community.
Anyways, I found VILA^2 very interesting, I personally am very excited about the self-improvement/bootstrapping in LMMs, and will definitely keep an eye out for your future work!

librarian-bot

Jul 26

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

nielsr

Jul 29

Hi @Seerkfang congrats on your work!

Would be great to link the models, datasets and other artifacts to the paper (similar to VILA v1: https://huggingface.co./papers/2312.07533).

Let me know if you need any help :)

Cheers,
Niels
Open-source @ HF

Seerkfang

Paper author Jul 29

•

edited Jul 29

Hi Niels,

Thank you for your attention. VILA is a fully open-source family of models. We will definitely release the checkpoints and, pending legal approval, the data (recapped by larger models) as well. We are currently working on results for larger models and will provide more detailed information soon.