weizhiwang
/

LLaVA-Video-Llama-3-8b

@@ -7,15 +7,11 @@ language:
 - en
 ---
-# Model Card for LLaVA-Video-LLaMA-3
 <!-- Provide a quick summary of what the model is/does. -->
-Please follow my github repo [LLaVA-Video-Llama-3](https://github.com/Victorwz/LLaVA-Video-Llama-3/) for more details on fine-tuning LLaVA model with Llama-3 as the foundatiaon LLM.
-## Updates
-- [6/4/2024] The codebase supports the video data fine-tuning for video understanding tasks.
-- [5/14/2024] The codebase has been upgraded to llava-next (llava-v1.6). Now it supports the latest llama-3, phi-3, mistral-v0.1-7b models.
 ## Model Details
 - Video Frame Sampling: Considering we adopt CLIP-ViT-L-336px as the image encoder (576 tokens for one image) and the context window of LLaMA-3 is 8k, the video frame sampling rate is set as max(30, num_frames//10).
@@ -24,9 +20,11 @@ Please follow my github repo [LLaVA-Video-Llama-3](https://github.com/Victorwz/L
 ## How to Use
-Please firstly install llava via
 ```
-pip install git+https://github.com/Victorwz/LLaVA-Video-Llama-3.git
 ```
 You can load the model and perform inference as follows:
@@ -45,8 +43,8 @@ import numpy as np
 # load model and processor
 device = "cuda" if torch.cuda.is_available() else "cpu"
-model_name = get_model_name_from_path("weizhiwang/LLaVA-Video-Llama-3")
-tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/LLaVA-Video-Llama-3", None, model_name, False, False, device=device)
 # prepare image input
 url = "https://github.com/PKU-YuanGroup/Video-LLaVA/raw/main/videollava/serve/examples/sample_demo_1.mp4"
@@ -109,16 +107,13 @@ The image caption results look like:
 The video is funny because it shows a baby girl wearing glasses and reading a book, which is an unusual and amusing sight. It is not common to see a baby wearing glasses and engaging in a reading activity, as they are still developing their motor skills and cognitive abilities. The image captures a cute and endearing moment, as the baby appears to be enjoying her time and learning to read. This scene can evoke a sense of warmth and delight in the viewer, as it showcases the innocence and curiosity of childhood.
 ```
-# Fine-Tune LLaVA-Llama-3 on Your Video Instruction Data
-Please refer to a forked [LLaVA-Video-Llama-3](https://github.com/Victorwz/LLaVA-Video-Llama-3) git repo for fine-tuning data preparation and scripts. The data loading function and fastchat conversation template are changed due to a different tokenizer.
 ## Citation
 ```bibtex
-@misc{wang2024llavavideollama3,
-  title={LLaVA-Video-Llama-3: A Video Understanding Multimodal LLM based on Llama-3-8B LLM backbone},
-  author={Wang, Weizhi},
-  year={2024}
 }
 ```

 - en
 ---
+# Model Card for LaViA-Llama-3-8b
 <!-- Provide a quick summary of what the model is/does. -->
+Please follow my github repo [LaViA](https://github.com/Victorwz/LaViA) for more details on fine-tuning LaViA model with Llama-3 as the foundatiaon LLM.
 ## Model Details
 - Video Frame Sampling: Considering we adopt CLIP-ViT-L-336px as the image encoder (576 tokens for one image) and the context window of LLaMA-3 is 8k, the video frame sampling rate is set as max(30, num_frames//10).
 ## How to Use
+Please firstly install lavia via
 ```
+git clone https://github.com/Victorwz/LaViA
+cd LaViA-video-sft
+pip install -e ./
 ```
 You can load the model and perform inference as follows:
 # load model and processor
 device = "cuda" if torch.cuda.is_available() else "cpu"
+model_name = get_model_name_from_path("weizhiwang/weizhiwang/LaViA-Llama-38b")
+tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/LaViA-Llama-38b", None, model_name, False, False, device=device)
 # prepare image input
 url = "https://github.com/PKU-YuanGroup/Video-LLaVA/raw/main/videollava/serve/examples/sample_demo_1.mp4"
 The video is funny because it shows a baby girl wearing glasses and reading a book, which is an unusual and amusing sight. It is not common to see a baby wearing glasses and engaging in a reading activity, as they are still developing their motor skills and cognitive abilities. The image captures a cute and endearing moment, as the baby appears to be enjoying her time and learning to read. This scene can evoke a sense of warmth and delight in the viewer, as it showcases the innocence and curiosity of childhood.
 ```
 ## Citation
 ```bibtex
+@misc{wang2024LaViA,
+      title={LaViA: Fine-Tuning Multimodal LLMs as Task Assistants with Video Instructions},
+      url={https://github.com/Victorwz/LaViA},
+      author={Wang, Weizhi and Luo, Xuan and Yan, Xifeng},
+      year={2024},
 }
 ```