File size: 4,085 Bytes
7190045
 
 
 
 
 
 
 
 
2fa03da
8965660
4f93c97
 
 
 
 
7190045
 
7c5d988
 
982c1f8
7c5d988
 
7190045
 
 
 
 
 
 
260330d
 
 
7190045
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8965660
7190045
 
 
 
 
8965660
7190045
 
 
 
 
 
 
 
 
 
 
8965660
7190045
 
 
 
 
 
 
 
 
8965660
7190045
 
 
 
 
 
 
 
 
 
8965660
7190045
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8965660
7190045
 
 
 
 
 
 
 
8965660
7190045
 
 
3f4bd65
 
 
 
 
 
 
260330d
 
cade16e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
datasets:
- liuhaotian/LLaVA-Pretrain
- liuhaotian/LLaVA-Instruct-150K
language:
- en
tags:
- llava
- phi
license: mit
library_name: transformers
widget:
- text: "What animal is it?"
  src: "https://huggingface.co./datasets/mishig/sample_images/resolve/main/tiger.jpg"
- text: "Where is it?"
  src: "https://huggingface.co./datasets/mishig/sample_images/resolve/main/palace.jpg"
---

# LLaVA-3b

<a target="_blank" href="https://colab.research.google.com/drive/1W7JQrFXwFunAY1XvS31mwC7mrXBgGD_M">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Model details

LLaVA-3b is a model fine-tuned from [Dolphin 2.6 Phi](https://huggingface.co./cognitivecomputations/dolphin-2_6-phi-2) in a LLaVA fashion using vision tower from
[SigLIP 400M](https://huggingface.co./timm/ViT-SO400M-14-SigLIP-384). There are a couple of things different from the original LLaVA architecture:

1. Multiple image tokens. The multimodal projector generates embeddings of shape [5, 2560] instead of [1, 2560] for images. The idea is that using more tokens
   allows us to get more info from the image into the language model.
2. The model uses the output from the latest layer of the vision encoder instead of the intermediate one.
3. The context length during training was 1200 tokens, as the L4 GPUs I used didn't allow me to get more.

As Dolphin 2.6 Phi, LLaVA-3b uses ChatML prompt format:

```
<|im_start|>system
You are Dolphin, a helpful AI assistant.<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
```

## How to use

**Install dependencies**

```bash
!pip install -q open_clip_torch timm einops
```

**Download modeling files**

```python
from huggingface_hub import hf_hub_download

hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="configuration_llava.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="configuration_phi.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="modeling_llava.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="modeling_phi.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="processing_llava.py", local_dir="./", force_download=True)
```

**Create a model**

```python
from modeling_llava import LlavaForConditionalGeneration
import torch

model = LlavaForConditionalGeneration.from_pretrained("visheratin/LLaVA-3b", torch_dtype=torch.float16)
model = model.to("cuda")
```

**Create processors**

```python
from transformers import AutoTokenizer
from processing_llava import LlavaProcessor, OpenCLIPImageProcessor

tokenizer = AutoTokenizer.from_pretrained("visheratin/LLaVA-3b")
image_processor = OpenCLIPImageProcessor(model.config.preprocess_config)
processor = LlavaProcessor(image_processor, tokenizer)
```

**Set image and text**

```python
from PIL import Image
import requests

image_file = "https://images.unsplash.com/photo-1439246854758-f686a415d9da"
raw_image = Image.open(requests.get(image_file, stream=True).raw)

prompt = """<|im_start|>system
A chat between a curious human and an artificial intelligence assistant.
The assistant gives helpful, detailed, and polite answers to the human's questions.
The assistant does not hallucinate and pays very close attention to the details.<|im_end|>
<|im_start|>user
<image>
Describe the image.<|im_end|>
<|im_start|>assistant
"""
```

**Process inputs**

```python
inputs = processor(prompt, raw_image, model, return_tensors='pt')

inputs['input_ids'] = inputs['input_ids'].to(model.device)
inputs['attention_mask'] = inputs['attention_mask'].to(model.device)
```

**Generate the data**

```python
output = model.generate(**inputs, max_new_tokens=200, do_sample=True, top_p=0.5, temperature=1.2, eos_token_id=tokenizer.eos_token_id)
```

## Benchmarks

- TextVQA - 33.25%
- GQA - 47.15%
- VQAv2 - 63.1%
- VizWiz - 24.03%

## Acknowledgments

Thanks to [ML Collective](https://mlcollective.org/) for providing credits for computing resources.