Model Name: llava-v1.5-13b-dpo
[Arxiv paper] [GitHub] [Data] [Model] [Data]
Developers: Shengzhi Li (TIFIN), Rongyu Lin (KAUST), Shichao Pei (University of Massachusetts Boston)
Affiliations: TIFIN, KAUST, University of Massachusetts Boston
Contact Information: [email protected], [email protected], [email protected]
Overview
The llava-v1.5-13b-dpo model is designed to enhance the instruction-following capabilities of multi-modal large language models (MLLMs), particularly in scenarios where visual instruction tuning might degrade language proficiency. This model leverages a novel Direct Preference Optimization (DPO) method, along with a curated 6K-entry VQA preference dataset, to achieve superior performance on multi-modal tasks and benchmarks.
Intended Use
- Primary Applications: This model is intended for tasks requiring the integration of text and image modalities, including but not limited to visual question answering (VQA), image captioning, and multi-modal instruction following.
- Target Audience: Researchers and practitioners in the fields of natural language processing, computer vision, and multi-modal AI.
Training Data
The MM-LLM-DPO model was trained using a lightweight (6k entries) VQA preference dataset, where answers were annotated for 5 quality metrics in a granular fashion. The dataset was designed to address the diversity and complexity gap typically observed in VQA datasets.
Evaluation
The model demonstrates significant improvements over baseline models like Vicuna and LLaVA on various benchmarks:
- MT-Bench: Achieved a score of 6.73, surpassing Vicuna's 6.57 and LLaVA's 5.99.
- Visual Instruction Performance: Recorded a +4.9% improvement on MM-Vet and +6% on LLaVA-Bench.
Model Name | MM-Vet | LLaVA-bench | PoPe | MM-Bench | MT-bench | AlpacaEval |
---|---|---|---|---|---|---|
Vicuna-1.5-13b [16] | - | - | - | - | 6.57 | 81.4 |
LLaVA-1.5-13b [10] | 36.3 | 73.1 | 0.859 | 67.4 | 5.99 | 79.3 |
LLaVA-RLHF-13b [23] | 37.2 | 76.8 | 0.869 | 60.1 | 6.18 | 81.0 |
Standard SFT | 36.5 | 63.7 | 0.850 | 65.4 | 5.01 | 50.2 |
SteerLM | 35.2 | 67.0 | 0.878 | 65.1 | 5.70 | 68.8 |
Rejection-sampling | 38.0 | 70.6 | 0.883 | 67.6 | 6.22 | 74.9 |
llava-v1.5-13b-dpo | 41.2 | 79.1 | 0.870 | 66.8 | 6.73 | 86.4 |
*We applied the last four Standard sft, SteerLM, Rejection Sampling and DPO, and found DPO to be most performant
Ethical Considerations
This model was developed with a focus on mitigating modality conflict and catastrophic forgetting in MLLMs. Users are encouraged to consider the potential biases and limitations inherent in the training data and model outputs, especially when deploying the model in diverse and sensitive contexts.
Limitations
- The model's training dataset, while addressing key gaps in VQA datasets, is relatively small at 6k entries. This may limit the model's generalizability across broader or more diverse multi-modal tasks.
- Performance enhancements, particularly in language instruction capabilities post-visual tuning, are based on the current scope of evaluated benchmarks and datasets. The model's efficacy may vary in different or more challenging contexts.
Acknowledgments
This work was made possible through the contributions of Shengzhi Li, Rongyu Lin, and Shichao Pei, and supported by their respective institutions.
Citation
Please cite this work as:
@misc{li2024multimodal,
title={Multi-modal preference alignment remedies regression of visual instruction tuning on language model},
author={Shengzhi Li and Rongyu Lin and Shichao Pei},
year={2024},
eprint={2402.10884},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Downloads last month
- 10