File size: 5,591 Bytes
4dbf61a a39ec3e 4dbf61a 9076059 d9c4295 0cfb71e 4dbf61a dcc76c9 9076059 dcc76c9 9076059 c509a63 9076059 5cfc077 c509a63 c2e74c6 c509a63 5cfc077 c509a63 d9c4295 2f7f20b d9c4295 c509a63 216670a c509a63 9076059 d7f9c48 9076059 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
---
license: apache-2.0
tags:
- vision
- image-text-to-text
language:
- en
pipeline_tag: image-text-to-text
inference: true
---
# LLaVa-Next, leveraging [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co./mistralai/Mistral-7B-Instruct-v0.2) as LLM
The LLaVA-NeXT model was proposed in [LLaVA-NeXT: Improved reasoning, OCR, and world knowledge](https://llava-vl.github.io/blog/2024-01-30-llava-next/) by Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, Yong Jae Lee. LLaVa-NeXT (also called LLaVa-1.6) improves upon [LLaVa-1.5](https://huggingface.co./transformers/main/model_doc/llava.html) by increasing the input image resolution and training on an improved visual instruction tuning dataset to improve OCR and common sense reasoning.
Disclaimer: The team releasing LLaVa-NeXT did not write a model card for this model so this model card has been written by the Hugging Face team.
## Model description
LLaVa combines a pre-trained large language model with a pre-trained vision encoder for multimodal chatbot use cases. LLaVA 1.6 improves on LLaVA 1.5 BY:
- Using [Mistral-7B](https://mistral.ai/news/announcing-mistral-7b/) (for this checkpoint) and [Nous-Hermes-2-Yi-34B](https://huggingface.co./NousResearch/Nous-Hermes-2-Yi-34B) which has better commercial licenses,
and bilingual support
- More diverse and high quality data mixture
- Dynamic high resolution
![image/png](https://cdn-uploads.huggingface.co/production/uploads/62441d1d9fdefb55a0b7d12c/FPshq08TKYD0e-qwPLDVO.png)
## Intended uses & limitations
You can use the raw model for tasks like image captioning, visual question answering, multimodal chatbot use cases. See the [model hub](https://huggingface.co./models?search=llava-hf) to look for
other versions on a task that interests you.
### How to use
Here's the prompt template for this model but we recomment to use the chat templates to format the prompt with `processor.apply_chat_template()`.
That will apply the correct template for a given checkpoint for you.
```
"[INST] <image>\nWhat is shown in this image? [/INST]"
```
To run the model with the `pipeline`, see the below example:
```python
from transformers import pipeline
pipe = pipeline("image-text-to-text", model="llava-hf/llava-v1.6-mistral-7b-hf")
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co./datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"},
{"type": "text", "text": "What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud"},
],
},
]
out = pipe(text=messages, max_new_tokens=20)
print(out)
>>> [{'input_text': [{'role': 'user', 'content': [{'type': 'image', 'url': 'https://huggingface.co./datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg'}, {'type': 'text', 'text': 'What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud'}]}], 'generated_text': 'Lava'}]
```
You can also load and use the model like following:
```python
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image
import requests
processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True)
model.to("cuda:0")
# prepare image and text prompt, using the appropriate prompt template
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image")
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "What is shown in this image?"},
{"type": "image"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda:0")
# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output[0], skip_special_tokens=True))
```
### Model optimization
#### 4-bit quantization through `bitsandbytes` library
First make sure to install `bitsandbytes`, `pip install bitsandbytes` and make sure to have access to a CUDA compatible GPU device. Simply change the snippet above with:
```diff
model = LlavaNextForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
+ load_in_4bit=True
)
```
#### Use Flash-Attention 2 to further speed-up generation
First make sure to install `flash-attn`. Refer to the [original repository of Flash Attention](https://github.com/Dao-AILab/flash-attention) regarding that package installation. Simply change the snippet above with:
```diff
model = LlavaNextForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
+ use_flash_attention_2=True
).to(0)
```
### BibTeX entry and citation info
```bibtex
@misc{liu2023improved,
title={Improved Baselines with Visual Instruction Tuning},
author={Haotian Liu and Chunyuan Li and Yuheng Li and Yong Jae Lee},
year={2023},
eprint={2310.03744},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
``` |