llava-llama-3-8b-hf / README.md
Seungyoun's picture
Update Quickstart
98bf3df verified
|
raw
history blame
5.23 kB
metadata
license: llama3
library_name: xtuner
datasets:
  - Lin-Chen/ShareGPT4V
pipeline_tag: image-text-to-text

Notice: This repository hosts the llava-llama-3-8b-v1_1-hf model, which has been specifically modified to address compatibility issues with the pure transformers library. The original model configuration and index files have been manually adjusted to ensure seamless integration and functionality with the transformers setup. These adjustments have not altered the model weights.


QuickStart

Chat with lmdeploy

  1. Installation
pip install 'lmdeploy>=0.4.0'
pip install git+https://github.com/haotian-liu/LLaVA.git
  1. Run

Running with pure transformers library

from transformers import (
    LlavaProcessor,
    LlavaForConditionalGeneration,
)
import torch
from PIL import Image
import requests

MODEL_NAME = "Seungyoun/llava-llama-3-8b-hf"

processor = LlavaProcessor.from_pretrained(MODEL_NAME)
# add 128257 <image> , <pad>
processor.tokenizer.add_tokens(["<|image|>", "<pad>"], special_tokens=True)

model = LlavaForConditionalGeneration.from_pretrained(MODEL_NAME).to("cuda:0")
# resize embeddings
model.resize_token_embeddings(len(processor.tokenizer))


# prepare image and text prompt, using the appropriate prompt template
url = "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTd4g61TSw890IYKBbPMgXPyWAKdVOpWWUAF0-FGzgX2Q&s"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <|image|>\nWhat is shown in this image? ASSISTANT:" # FIX : Chat template

inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")

# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=100)

print(processor.decode(output[0], skip_special_tokens=True))
# What is shown in this image? ASSISTANT: The image shows a heartwarming scene of two dogs sitting together on a couch. The dogs are of different breeds, one being a golden retriever and the other being a tabby cat. The dogs are sitting close together, indicating a strong bond between them. The image captures a beautiful moment of companionship between two different species. sit on couch. golden retriever and tabby cat. dogs are sitting together. companionship between two different species.

Model

llava-llama-3-8b-v1_1-hf is a LLaVA model fine-tuned from meta-llama/Meta-Llama-3-8B-Instruct and CLIP-ViT-Large-patch14-336 with ShareGPT4V-PT and InternVL-SFT by XTuner.

Details

Model Visual Encoder Projector Resolution Pretraining Strategy Fine-tuning Strategy Pretrain Dataset Fine-tune Dataset
LLaVA-v1.5-7B CLIP-L MLP 336 Frozen LLM, Frozen ViT Full LLM, Frozen ViT LLaVA-PT (558K) LLaVA-Mix (665K)
LLaVA-Llama-3-8B CLIP-L MLP 336 Frozen LLM, Frozen ViT Full LLM, LoRA ViT LLaVA-PT (558K) LLaVA-Mix (665K)
LLaVA-Llama-3-8B-v1.1 CLIP-L MLP 336 Frozen LLM, Frozen ViT Full LLM, LoRA ViT ShareGPT4V-PT (1246K) InternVL-SFT (1268K)

Results

Image
Model MMBench Test (EN) MMBench Test (CN) CCBench Dev MMMU Val SEED-IMG AI2D Test ScienceQA Test HallusionBench aAcc POPE GQA TextVQA MME MMStar
LLaVA-v1.5-7B 66.5 59.0 27.5 35.3 60.5 54.8 70.4 44.9 85.9 62.0 58.2 1511/348 30.3
LLaVA-Llama-3-8B 68.9 61.6 30.4 36.8 69.8 60.9 73.3 47.3 87.2 63.5 58.0 1506/295 38.2
LLaVA-Llama-3-8B-v1.1 72.3 66.4 31.6 36.8 70.1 70.0 72.9 47.7 86.4 62.6 59.0 1469/349 45.1

Citation

@misc{2023xtuner,
    title={XTuner: A Toolkit for Efficiently Fine-tuning LLM},
    author={XTuner Contributors},
    howpublished = {\url{https://github.com/InternLM/xtuner}},
    year={2023}
}