visheratin
/

MC-LLaVA-3b

@@ -24,16 +24,12 @@ widget:
 ## Model details
-The core idea behind multi-crop LLaVA (MC-LLaVA) is that instead of N visual token embeddings per image, I generate one token embedding per N parts of the image.
-Having high-quality embeddings for smaller parts of the image helps to extract more details and understand the scene better.
-For every crop of the image, I generate an embedding from the full SigLIP encoder (size [1, 1152]) and then push all N embeddings through the LLaVA adapter, which
-gives the token embedding of size [N, 2560]. Right now, the tokens do not contain explicit information about their position in the original image. I plan to add it later.
-MC-LLaVA-3b was fine-tuned from [Dolphin 2.6 Phi](https://huggingface.co/cognitivecomputations/dolphin-2_6-phi-2) using vision tower from
-[SigLIP 400M](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384).
-The context length during training was 1200 tokens, as the L4 GPUs I used didn't allow me to get more.
 As Dolphin 2.6 Phi, LLaVA-3b uses ChatML prompt format:
@@ -47,91 +43,30 @@ You are Dolphin, a helpful AI assistant.<|im_end|>
 ## How to use
-**Install dependencies**
-```bash
-!pip install -q open_clip_torch timm einops
-```
-**Download modeling files**
 ```python
-from huggingface_hub import hf_hub_download
-hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="configuration_llava.py", local_dir="./", force_download=True)
-hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="configuration_phi.py", local_dir="./", force_download=True)
-hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="modeling_llava.py", local_dir="./", force_download=True)
-hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="modeling_phi.py", local_dir="./", force_download=True)
-hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="processing_llava.py", local_dir="./", force_download=True)
-```
-**Create a model**
-```python
-from modeling_llava import LlavaForConditionalGeneration
 import torch
-model = LlavaForConditionalGeneration.from_pretrained("visheratin/LLaVA-3b", torch_dtype=torch.float16)
-model = model.to("cuda")
-```
-**Create processors**
-```python
-from transformers import AutoTokenizer
-from processing_llava import LlavaProcessor, OpenCLIPImageProcessor
-tokenizer = AutoTokenizer.from_pretrained("visheratin/LLaVA-3b")
-image_processor = OpenCLIPImageProcessor(model.config.preprocess_config)
-processor = LlavaProcessor(image_processor, tokenizer)
-```
-**Set image and text**
-```python
-from PIL import Image
-import requests
-image_file = "https://images.unsplash.com/photo-1439246854758-f686a415d9da"
-raw_image = Image.open(requests.get(image_file, stream=True).raw)
-prompt = """<|im_start|>system
-A chat between a curious human and an artificial intelligence assistant.
-The assistant gives helpful, detailed, and polite answers to the human's questions.
-The assistant does not hallucinate and pays very close attention to the details.<|im_end|>
-<|im_start|>user
-<image>
-Describe the image.<|im_end|>
-<|im_start|>assistant
-"""
-```
-**Process inputs**
-```python
-inputs = processor(prompt, raw_image, model, return_tensors='pt')
-inputs['input_ids'] = inputs['input_ids'].to(model.device)
-inputs['attention_mask'] = inputs['attention_mask'].to(model.device)
-```
-**Generate the data**
-```python
-import torch
-with torch.inference_mode():
-  output = model.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.4, pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id)
 ```
 ## Benchmarks
-- TextVQA - 38.59%
-- GQA - 49.6%
-- VQAv2 - 64.24%
-- VizWiz - 24.88%
-- POPE - 80.59%
-- V*-bench - 52.25% (OCR - 46.66%, GPT4V-hard - 41.17%, direct attributes - 43.48%, relative position - 65.79%)
 ## Examples
@@ -146,4 +81,6 @@ Which means don't create competitor models for them.
 ## Acknowledgments
-Thanks to [ML Collective](https://mlcollective.org/) for providing credits for computing resources.

 ## Model details
+Usually, in LLaVA models, we generate N embeddings for the image, which we then combine with text embeddings and send to the LLM. But what if instead of creating N tokens
+for one image, we create K<<N tokens for M<N parts of the image (crops)? It would allow us to get visual information from small parts of the image and not inflate the
+number of image "tokens" too much. I called this method multi-crop LLaVA (MC-LLaVA).
+MC-LLaVA-3b was fine-tuned from [Phi-2 merge](vince62s/phi-2-psy) using vision tower from
+[SigLIP 400M](https://huggingface.co/google/siglip-so400m-patch14-384).
 As Dolphin 2.6 Phi, LLaVA-3b uses ChatML prompt format:
 ## How to use
 ```python
+from transformers import AutoModel, AutoProcessor
 import torch
+model = AutoModel.from_pretrained("visheratin/MC-LLaVA-3b", torch_dtype=torch.float16, trust_remote_code=True).to("cuda")
+processor = AutoProcessor.from_pretrained("visheratin/MC-LLaVA-3b", trust_remote_code=True)
+with torch.inference_mode():
+    inputs = processor(prompt, [raw_image], model, max_crops=100, num_tokens=728)
+output = model.generate(**inputs, max_new_tokens=200, use_cache=True, do_sample=False, eos_token_id=processor.tokenizer.eos_token_id, pad_token_id=processor.tokenizer.eos_token_id)
+result = processor.tokenizer.decode(output[0]).replace(prompt, "").replace("<|im_end|>", "")
+print(result)
 ```
 ## Benchmarks
+- TextVQA - 50.9%
+- GQA - 59.5%
+- VQAv2 - 76.72%
+- VizWiz - 32.68%
+- V*-bench - OCR - 56.66%, GPT4V-hard - 52.94%, direct attributes - 40.86%, relative position - 56.57%
 ## Examples
 ## Acknowledgments
+Thanks to [Lambda](https://lambdalabs.com/) for providing a machine to train the model.
+Thanks to [ML Collective](https://mlcollective.org/) for continuous support and providing compute resources for testing the model.