visheratin
/

MC-LLaVA-3b

+---
+datasets:
+- liuhaotian/LLaVA-Pretrain
+- liuhaotian/LLaVA-Instruct-150K
+language:
+- en
+tags:
+- llava
+- phi
+---
+# LLaVA-3b Model Card
+## Model details
+LLaVA-3b is a model fine-tuned from [Dolphin 2.6 Phi](https://huggingface.co/cognitivecomputations/dolphin-2_6-phi-2) in a LLaVA fashion using vision tower from
+[SigLIP 400M](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384). There are a couple of things different from the original LLaVA architecture:
+1. Multiple image tokens. The multimodal projector generates embeddings of shape [5, 2560] instead of [1, 2560] for images. The idea is that using more tokens
+   allows to get more info from the image into the language model.
+2. The model uses the output from the latest layer of the vision encoder instead of intermediate one.
+As Dolphin 2.6 Phi, LLaVA-3b uses ChatML prompt format:
+```
+<|im_start|>system
+You are Dolphin, a helpful AI assistant.<|im_end|>
+<|im_start|>user
+{prompt}<|im_end|>
+<|im_start|>assistant
+```
+## How to use
+**Install dependencies**
+```
+!pip install -q open_clip_torch timm einops
+```
+**Download modeling files**
+```
+from huggingface_hub import hf_hub_download
+hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="configuration_llava.py", local_dir="./", force_download=True)
+hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="configuration_phi.py", local_dir="./", force_download=True)
+hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="modeling_llava.py", local_dir="./", force_download=True)
+hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="modeling_phi.py", local_dir="./", force_download=True)
+hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="processing_llava.py", local_dir="./", force_download=True)
+```
+**Create a model**
+```
+from modeling_llava import LlavaForConditionalGeneration
+import torch
+model = LlavaForConditionalGeneration.from_pretrained("visheratin/LLaVA-3b", torch_dtype=torch.float16)
+model = model.to("cuda")
+```
+**Create processors**
+```
+from transformers import AutoTokenizer
+from processing_llava import LlavaProcessor, OpenCLIPImageProcessor
+tokenizer = AutoTokenizer.from_pretrained("visheratin/LLaVA-3b")
+image_processor = OpenCLIPImageProcessor(model.config.preprocess_config)
+processor = LlavaProcessor(image_processor, tokenizer)
+```
+**Set image and text**
+```
+from PIL import Image
+import requests
+image_file = "https://images.unsplash.com/photo-1439246854758-f686a415d9da"
+raw_image = Image.open(requests.get(image_file, stream=True).raw)
+prompt = """<|im_start|>system
+A chat between a curious human and an artificial intelligence assistant.
+The assistant gives helpful, detailed, and polite answers to the human's questions.
+The assistant does not hallucinate and pays very close attention to the details.<|im_end|>
+<|im_start|>user
+<image>
+Describe the image.<|im_end|>
+<|im_start|>assistant
+"""
+```
+**Process inputs**
+```
+inputs = processor(prompt, raw_image, model, return_tensors='pt')
+inputs['input_ids'] = inputs['input_ids'].to(model.device)
+inputs['attention_mask'] = inputs['attention_mask'].to(model.device)
+```
+**Generate the data**
+```
+output = model.generate(**inputs, max_new_tokens=200, do_sample=True, top_p=0.5, temperature=1.2, eos_token_id=tokenizer.eos_token_id)
+```
+## License
+This model is based on Phi-2 and is governed by Microsoft's microsoft-research-license which prohibits commercial use.
+**Where to send questions or comments about the model:**
+https://twitter.com/visheratin