---
license: llama3
library_name: xtuner
datasets:
- Lin-Chen/ShareGPT4V
pipeline_tag: image-text-to-text
---

---

**Notice:** This repository hosts the [`xtuner/llava-llama-3-8b-v1_1-hf`](https://huggingface.co./xtuner/llava-llama-3-8b-v1_1-hf) model, which has been specifically modified to address compatibility issues with the pure `transformers` library. The original model configuration and index files have been manually adjusted to ensure seamless integration and functionality with the `transformers` setup. These adjustments have not altered the model weights.

---

## QuickStart

Running with pure `transformers` library

```python
from transformers import (
    LlavaProcessor,
    LlavaForConditionalGeneration,
)
from PIL import Image
import requests

MODEL_NAME = "Seungyoun/llava-llama-3-8b-hf"

processor = LlavaProcessor.from_pretrained(MODEL_NAME)
processor.tokenizer.add_tokens(
    ["<|image|>", "<pad>"], special_tokens=True
)  # add 128257 <|image|> , <pad>

model = LlavaForConditionalGeneration.from_pretrained(MODEL_NAME).to("cuda:0")
model.resize_token_embeddings(
    len(processor.tokenizer)
)  # resize embeddings for new tokens


# prepare image and text prompt, using the appropriate prompt template
url = "https://upload.wikimedia.org/wikipedia/commons/1/18/Kochendes_wasser02.jpg"
image = Image.open(requests.get(url, stream=True).raw)

template = """<|start_header_id|>system<|end_header_id|>{system_prompt}<|eot_id|>
<|start_header_id|>user<|end_header_id|>{user_msg_1}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>"""

terminators = [
    processor.tokenizer.eos_token_id,
    processor.tokenizer.convert_tokens_to_ids("<|eot_id|>"),
]

prompt = template.format(
    system_prompt="As a vision-llm, your task is to analyze and describe the contents of the image presented to you. Examine the photograph closely and provide a comprehensive, detailed caption. You should identify and describe the various food items and their arrangement, as well as any discernible textures, colors, and specific features of the containers they are in. Highlight the variety and how these contribute to the overall visual appeal of the meal. Your description should help someone who cannot see the image to visualize its contents accurately.",
    user_msg_1="<|image|>\nGive me detailed description of the image.",
)

inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")

# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=1024, eos_token_id=terminators)

print(processor.decode(output[0], skip_special_tokens=False))
# The image captures a moment in a kitchen. The main focus is a white electric kettle, which is plugged in and resting on a black stovetop. The stovetop has four burners, although only one is occupied by the kettle. The background is blurred, drawing attention to the kettle and stovetop. The image does not contain any text or additional objects. The relative position of the objects is such that the kettle is on the stovetop, and the background is blurred.
```
---

</div>

## Model

llava-llama-3-8b-v1_1-hf is a LLaVA model fine-tuned from [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co./meta-llama/Meta-Llama-3-8B-Instruct) and [CLIP-ViT-Large-patch14-336](https://huggingface.co./openai/clip-vit-large-patch14-336) with [ShareGPT4V-PT](https://huggingface.co./datasets/Lin-Chen/ShareGPT4V) and [InternVL-SFT](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets) by [XTuner](https://github.com/InternLM/xtuner).


## Details

| Model                 | Visual      Encoder | Projector | Resolution |   Pretraining Strategy | Fine-tuning      Strategy |      Pretrain     Dataset |    Fine-tune     Dataset |
| :-------------------- | ------------------: | --------: | ---------: | ---------------------: | ------------------------: | ------------------------: | -----------------------: |
| LLaVA-v1.5-7B         |              CLIP-L |       MLP |        336 | Frozen LLM, Frozen ViT |      Full LLM, Frozen ViT |       LLaVA-PT     (558K) |     LLaVA-Mix     (665K) |
| LLaVA-Llama-3-8B      |              CLIP-L |       MLP |        336 | Frozen LLM, Frozen ViT |        Full LLM, LoRA ViT |       LLaVA-PT     (558K) |     LLaVA-Mix     (665K) |
| LLaVA-Llama-3-8B-v1.1 |              CLIP-L |       MLP |        336 | Frozen LLM, Frozen ViT |        Full LLM, LoRA ViT | ShareGPT4V-PT     (1246K) | InternVL-SFT     (1268K) |

## Results

<div  align="center">
<img src="https://github.com/InternLM/xtuner/assets/36994684/a157638c-3500-44ed-bfab-d8d8249f91bb" alt="Image" width=500" />
</div>

| Model                 | MMBench Test (EN) | MMBench Test (CN) | CCBench Dev | MMMU  Val | SEED-IMG | AI2D Test | ScienceQA Test | HallusionBench aAcc | POPE | GQA  | TextVQA |   MME    | MMStar |
| :-------------------- | :---------------: | :---------------: | :---------: | :-------: | :------: | :-------: | :------------: | :-----------------: | :--: | :--: | :-----: | :------: | :----: |
| LLaVA-v1.5-7B         |       66.5        |       59.0        |    27.5     |   35.3    |   60.5   |   54.8    |      70.4      |        44.9         | 85.9 | 62.0 |  58.2   | 1511/348 |  30.3  |
| LLaVA-Llama-3-8B      |       68.9        |       61.6        |    30.4     |   36.8    |   69.8   |   60.9    |      73.3      |        47.3         | 87.2 | 63.5 |  58.0   | 1506/295 |  38.2  |
| LLaVA-Llama-3-8B-v1.1 |       72.3        |       66.4        |    31.6     |   36.8    |   70.1   |   70.0    |      72.9      |        47.7         | 86.4 | 62.6 |  59.0   | 1469/349 |  45.1  |


## Citation

```bibtex
@misc{2023xtuner,
    title={XTuner: A Toolkit for Efficiently Fine-tuning LLM},
    author={XTuner Contributors},
    howpublished = {\url{https://github.com/InternLM/xtuner}},
    year={2023}
}
```