LLaVA-RLHF Model Card

Model details

Model type: LLaVA-RLHF represents a novel aligned end-to-end trained large multimodal model that combines a CLIP vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive visual reasoning and perception capabilities mimicking spirits of the multimodal GPT-4. Via Factually Augmented RLHF, LLaVA-RLHF is presented to be more helpful and less hallucinated than LLaVA or other open-sourced LMMs.

Usage: NOTE: The RLHFed model is trained with LoRA and the bfloat16 data type. Users have to apply the PEFT-LoRA on the LLaVA-SFT+ model.

dtype = torch.bfloat16

model_path = "LLaVA-RLHF-13b-v1.5-336/sft_model"
lora_path = "LLaVA-RLHF-13b-v1.5-336/rlhf_lora_adapter_model"

model = LlavaLlamaForCausalLM.from_pretrained(
    model_path,
    device_map={"": "cuda:0"},
    torch_dtype=dtype,
)

model = PeftModel.from_pretrained(
    model,
    lora_path,
)

Model date: LLaVA-RLHF was trained in Sept 2023.

Paper or resources for more information: https://llava-rlhf.github.io/

License: Apache License 2.0

Where to send questions or comments about the model: https://github.com/llava-rlhf/LLaVA-RLHF/issues

Intended use

Primary intended uses: The primary use of LLaVA-RLHF is research on large multimodal chatbots.

Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

Training dataset

595K filtered image-text pairs from CC3M.

150K GPT-generated multimodal instruction-following chat data.

83K VQA v2 instruction-following VQA data.

16K A-OKVQA instruction-following CoT-VQA data.

23K FLICKR instruction-following spotting captioning data.

10K LLaVA-based human preference data