visheratin
commited on
Commit
•
7190045
1
Parent(s):
56aa8a6
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,113 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
datasets:
|
3 |
+
- liuhaotian/LLaVA-Pretrain
|
4 |
+
- liuhaotian/LLaVA-Instruct-150K
|
5 |
+
language:
|
6 |
+
- en
|
7 |
+
tags:
|
8 |
+
- llava
|
9 |
+
- phi
|
10 |
+
---
|
11 |
+
|
12 |
+
# LLaVA-3b Model Card
|
13 |
+
|
14 |
+
## Model details
|
15 |
+
|
16 |
+
LLaVA-3b is a model fine-tuned from [Dolphin 2.6 Phi](https://huggingface.co/cognitivecomputations/dolphin-2_6-phi-2) in a LLaVA fashion using vision tower from
|
17 |
+
[SigLIP 400M](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384). There are a couple of things different from the original LLaVA architecture:
|
18 |
+
|
19 |
+
1. Multiple image tokens. The multimodal projector generates embeddings of shape [5, 2560] instead of [1, 2560] for images. The idea is that using more tokens
|
20 |
+
allows to get more info from the image into the language model.
|
21 |
+
2. The model uses the output from the latest layer of the vision encoder instead of intermediate one.
|
22 |
+
|
23 |
+
As Dolphin 2.6 Phi, LLaVA-3b uses ChatML prompt format:
|
24 |
+
|
25 |
+
```
|
26 |
+
<|im_start|>system
|
27 |
+
You are Dolphin, a helpful AI assistant.<|im_end|>
|
28 |
+
<|im_start|>user
|
29 |
+
{prompt}<|im_end|>
|
30 |
+
<|im_start|>assistant
|
31 |
+
```
|
32 |
+
|
33 |
+
## How to use
|
34 |
+
|
35 |
+
**Install dependencies**
|
36 |
+
|
37 |
+
```
|
38 |
+
!pip install -q open_clip_torch timm einops
|
39 |
+
```
|
40 |
+
|
41 |
+
**Download modeling files**
|
42 |
+
|
43 |
+
```
|
44 |
+
from huggingface_hub import hf_hub_download
|
45 |
+
|
46 |
+
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="configuration_llava.py", local_dir="./", force_download=True)
|
47 |
+
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="configuration_phi.py", local_dir="./", force_download=True)
|
48 |
+
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="modeling_llava.py", local_dir="./", force_download=True)
|
49 |
+
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="modeling_phi.py", local_dir="./", force_download=True)
|
50 |
+
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="processing_llava.py", local_dir="./", force_download=True)
|
51 |
+
```
|
52 |
+
|
53 |
+
**Create a model**
|
54 |
+
|
55 |
+
```
|
56 |
+
from modeling_llava import LlavaForConditionalGeneration
|
57 |
+
import torch
|
58 |
+
|
59 |
+
model = LlavaForConditionalGeneration.from_pretrained("visheratin/LLaVA-3b", torch_dtype=torch.float16)
|
60 |
+
model = model.to("cuda")
|
61 |
+
```
|
62 |
+
|
63 |
+
**Create processors**
|
64 |
+
|
65 |
+
```
|
66 |
+
from transformers import AutoTokenizer
|
67 |
+
from processing_llava import LlavaProcessor, OpenCLIPImageProcessor
|
68 |
+
|
69 |
+
tokenizer = AutoTokenizer.from_pretrained("visheratin/LLaVA-3b")
|
70 |
+
image_processor = OpenCLIPImageProcessor(model.config.preprocess_config)
|
71 |
+
processor = LlavaProcessor(image_processor, tokenizer)
|
72 |
+
```
|
73 |
+
|
74 |
+
**Set image and text**
|
75 |
+
|
76 |
+
```
|
77 |
+
from PIL import Image
|
78 |
+
import requests
|
79 |
+
|
80 |
+
image_file = "https://images.unsplash.com/photo-1439246854758-f686a415d9da"
|
81 |
+
raw_image = Image.open(requests.get(image_file, stream=True).raw)
|
82 |
+
|
83 |
+
prompt = """<|im_start|>system
|
84 |
+
A chat between a curious human and an artificial intelligence assistant.
|
85 |
+
The assistant gives helpful, detailed, and polite answers to the human's questions.
|
86 |
+
The assistant does not hallucinate and pays very close attention to the details.<|im_end|>
|
87 |
+
<|im_start|>user
|
88 |
+
<image>
|
89 |
+
Describe the image.<|im_end|>
|
90 |
+
<|im_start|>assistant
|
91 |
+
"""
|
92 |
+
```
|
93 |
+
|
94 |
+
**Process inputs**
|
95 |
+
|
96 |
+
```
|
97 |
+
inputs = processor(prompt, raw_image, model, return_tensors='pt')
|
98 |
+
|
99 |
+
inputs['input_ids'] = inputs['input_ids'].to(model.device)
|
100 |
+
inputs['attention_mask'] = inputs['attention_mask'].to(model.device)
|
101 |
+
```
|
102 |
+
|
103 |
+
**Generate the data**
|
104 |
+
|
105 |
+
```
|
106 |
+
output = model.generate(**inputs, max_new_tokens=200, do_sample=True, top_p=0.5, temperature=1.2, eos_token_id=tokenizer.eos_token_id)
|
107 |
+
```
|
108 |
+
|
109 |
+
## License
|
110 |
+
This model is based on Phi-2 and is governed by Microsoft's microsoft-research-license which prohibits commercial use.
|
111 |
+
|
112 |
+
**Where to send questions or comments about the model:**
|
113 |
+
https://twitter.com/visheratin
|