metadata
license: llama3
library_name: xtuner
datasets:
- Lin-Chen/ShareGPT4V
pipeline_tag: image-text-to-text
Notice: This repository hosts the xtuner/llava-llama-3-8b-v1_1-hf
model, which has been specifically modified to address compatibility issues with the pure transformers
library. The original model configuration and index files have been manually adjusted to ensure seamless integration and functionality with the transformers
setup. These adjustments have not altered the model weights.
QuickStart
Chat with lmdeploy
- Installation
pip install 'lmdeploy>=0.4.0'
pip install git+https://github.com/haotian-liu/LLaVA.git
- Run
Running with pure transformers
library
from transformers import (
LlavaProcessor,
LlavaForConditionalGeneration,
)
import torch
from PIL import Image
import requests
MODEL_NAME = "Seungyoun/llava-llama-3-8b-hf"
processor = LlavaProcessor.from_pretrained(MODEL_NAME)
# add 128257 <image> , <pad>
processor.tokenizer.add_tokens(["<|image|>", "<pad>"], special_tokens=True)
model = LlavaForConditionalGeneration.from_pretrained(MODEL_NAME).to("cuda:0")
# resize embeddings
model.resize_token_embeddings(len(processor.tokenizer))
# prepare image and text prompt, using the appropriate prompt template
url = "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTd4g61TSw890IYKBbPMgXPyWAKdVOpWWUAF0-FGzgX2Q&s"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <|image|>\nWhat is shown in this image? ASSISTANT:" # FIX : Chat template
inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")
# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output[0], skip_special_tokens=True))
# What is shown in this image? ASSISTANT: The image shows a heartwarming scene of two dogs sitting together on a couch. The dogs are of different breeds, one being a golden retriever and the other being a tabby cat. The dogs are sitting close together, indicating a strong bond between them. The image captures a beautiful moment of companionship between two different species. sit on couch. golden retriever and tabby cat. dogs are sitting together. companionship between two different species.
Model
llava-llama-3-8b-v1_1-hf is a LLaVA model fine-tuned from meta-llama/Meta-Llama-3-8B-Instruct and CLIP-ViT-Large-patch14-336 with ShareGPT4V-PT and InternVL-SFT by XTuner.
Details
Model | Visual Encoder | Projector | Resolution | Pretraining Strategy | Fine-tuning Strategy | Pretrain Dataset | Fine-tune Dataset |
---|---|---|---|---|---|---|---|
LLaVA-v1.5-7B | CLIP-L | MLP | 336 | Frozen LLM, Frozen ViT | Full LLM, Frozen ViT | LLaVA-PT (558K) | LLaVA-Mix (665K) |
LLaVA-Llama-3-8B | CLIP-L | MLP | 336 | Frozen LLM, Frozen ViT | Full LLM, LoRA ViT | LLaVA-PT (558K) | LLaVA-Mix (665K) |
LLaVA-Llama-3-8B-v1.1 | CLIP-L | MLP | 336 | Frozen LLM, Frozen ViT | Full LLM, LoRA ViT | ShareGPT4V-PT (1246K) | InternVL-SFT (1268K) |
Results
Model | MMBench Test (EN) | MMBench Test (CN) | CCBench Dev | MMMU Val | SEED-IMG | AI2D Test | ScienceQA Test | HallusionBench aAcc | POPE | GQA | TextVQA | MME | MMStar |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LLaVA-v1.5-7B | 66.5 | 59.0 | 27.5 | 35.3 | 60.5 | 54.8 | 70.4 | 44.9 | 85.9 | 62.0 | 58.2 | 1511/348 | 30.3 |
LLaVA-Llama-3-8B | 68.9 | 61.6 | 30.4 | 36.8 | 69.8 | 60.9 | 73.3 | 47.3 | 87.2 | 63.5 | 58.0 | 1506/295 | 38.2 |
LLaVA-Llama-3-8B-v1.1 | 72.3 | 66.4 | 31.6 | 36.8 | 70.1 | 70.0 | 72.9 | 47.7 | 86.4 | 62.6 | 59.0 | 1469/349 | 45.1 |
Citation
@misc{2023xtuner,
title={XTuner: A Toolkit for Efficiently Fine-tuning LLM},
author={XTuner Contributors},
howpublished = {\url{https://github.com/InternLM/xtuner}},
year={2023}
}