Abstract
Visual language models (VLMs) have rapidly progressed, driven by the success of large language models (LLMs). While model architectures and training infrastructures advance rapidly, data curation remains under-explored. When data quantity and quality become a bottleneck, existing work either directly crawls more raw data from the Internet that does not have a guarantee of data quality or distills from black-box commercial models (e.g., GPT-4V / Gemini) causing the performance upper bounded by that model. In this work, we introduce a novel approach that includes a self-augment step and a specialist-augment step to iteratively improve data quality and model performance. In the self-augment step, a VLM recaptions its own pretraining data to enhance data quality, and then retrains from scratch using this refined dataset to improve model performance. This process can iterate for several rounds. Once self-augmentation saturates, we employ several specialist VLMs finetuned from the self-augmented VLM with domain-specific expertise, to further infuse specialist knowledge into the generalist VLM through task-oriented recaptioning and retraining. With the combined self-augmented and specialist-augmented training, we introduce VILA^2 (VILA-augmented-VILA), a VLM family that consistently improves the accuracy on a wide range of tasks over prior art, and achieves new state-of-the-art results on MMMU leaderboard among open-sourced models.
Community
Hi
@Seerkfang
-- interesting work!
Two questions/comments:
a) Would you consider applying Video-STaR to your approach (effectively utilizing your auxiliary labels to filter generated responses)? I found that answers that contain labels (these can be pseudo-labels derived from your aux. models, and/or original image captions) reduce generated answer hallucination.
b) Are you familiar with visual programming distillation?
Hi @orrzohar , thanks for your interest in our work!
- Yes, we do care about alleviating hallucinations in VLMs. Our grounding specialist has the potential to ensure that VILA^2 responses are accurate and free from hallucinations. I'll read your paper later to explore any synergies. :)
- Our papers focus on VLM augment VLM. Here are some points regarding the paper you mentioned:
① Multi-stage data processing can be challenging without a strong verifier (I have experience with LLM verification). For LLMs, we can use formal reasoning or an external interpreter for feedback, for VLMs, a "visible" judge is more reliable, and we are still working on it.
② Complex instruction-following is challenging for current VLMs, especially after so much SFT data targeting short benchmark answers.
③ In our appendix, we mentioned the challenge of performing bi-partite matching between detection results and detailed captions (ambiguities), with LLMs tending to introduce more hallucinations.
VILA^2 offers a practical solution for leveraging internet-scale raw visual data and provides a straightforward yet general learning approach for various models. Generalist VLMs can self-bootstrap and learn from specialist VLMs (new skills + better alignment). This method works for models at "any level" (hopefully) and can integrate diverse, task-/knowledge-specialized models, making it practical for real-world applications. We welcome your thoughts and invite you to stay tuned for our upcoming work.
Hi
@Seerkfang
,
Thank you for your response! I really liked VILA^2, and just thought that maybe you could draw from Video-STaR/VDP additional inspiration and further improve performance even in later generation cycles. I agree that pre-training in general is massively under-explored in LMMs, with VILA/VILA^2 being notable outliers in the OSS community.
Anyways, I found VILA^2 very interesting, I personally am very excited about the self-improvement/bootstrapping in LMMs, and will definitely keep an eye out for your future work!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity (2024)
- X-VILA: Cross-Modality Alignment for Large Language Model (2024)
- Tarsier: Recipes for Training and Evaluating Large Video Description Models (2024)
- Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge (2024)
- MAVIS: Mathematical Visual Instruction Tuning (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Hi @Seerkfang congrats on your work!
Would be great to link the models, datasets and other artifacts to the paper (similar to VILA v1: https://huggingface.co./papers/2312.07533).
Let me know if you need any help :)
Cheers,
Niels
Open-source @ HF
Hi Niels,
Thank you for your attention. VILA is a fully open-source family of models. We will definitely release the checkpoints and, pending legal approval, the data (recapped by larger models) as well. We are currently working on results for larger models and will provide more detailed information soon.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper