weizhiwang commited on
Commit
b70c5a1
·
verified ·
1 Parent(s): 194526c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -18
README.md CHANGED
@@ -7,15 +7,11 @@ language:
7
  - en
8
  ---
9
 
10
- # Model Card for LLaVA-Video-LLaMA-3
11
 
12
  <!-- Provide a quick summary of what the model is/does. -->
13
 
14
- Please follow my github repo [LLaVA-Video-Llama-3](https://github.com/Victorwz/LLaVA-Video-Llama-3/) for more details on fine-tuning LLaVA model with Llama-3 as the foundatiaon LLM.
15
-
16
- ## Updates
17
- - [6/4/2024] The codebase supports the video data fine-tuning for video understanding tasks.
18
- - [5/14/2024] The codebase has been upgraded to llava-next (llava-v1.6). Now it supports the latest llama-3, phi-3, mistral-v0.1-7b models.
19
 
20
  ## Model Details
21
  - Video Frame Sampling: Considering we adopt CLIP-ViT-L-336px as the image encoder (576 tokens for one image) and the context window of LLaMA-3 is 8k, the video frame sampling rate is set as max(30, num_frames//10).
@@ -24,9 +20,11 @@ Please follow my github repo [LLaVA-Video-Llama-3](https://github.com/Victorwz/L
24
 
25
  ## How to Use
26
 
27
- Please firstly install llava via
28
  ```
29
- pip install git+https://github.com/Victorwz/LLaVA-Video-Llama-3.git
 
 
30
  ```
31
 
32
  You can load the model and perform inference as follows:
@@ -45,8 +43,8 @@ import numpy as np
45
 
46
  # load model and processor
47
  device = "cuda" if torch.cuda.is_available() else "cpu"
48
- model_name = get_model_name_from_path("weizhiwang/LLaVA-Video-Llama-3")
49
- tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/LLaVA-Video-Llama-3", None, model_name, False, False, device=device)
50
 
51
  # prepare image input
52
  url = "https://github.com/PKU-YuanGroup/Video-LLaVA/raw/main/videollava/serve/examples/sample_demo_1.mp4"
@@ -109,16 +107,13 @@ The image caption results look like:
109
  The video is funny because it shows a baby girl wearing glasses and reading a book, which is an unusual and amusing sight. It is not common to see a baby wearing glasses and engaging in a reading activity, as they are still developing their motor skills and cognitive abilities. The image captures a cute and endearing moment, as the baby appears to be enjoying her time and learning to read. This scene can evoke a sense of warmth and delight in the viewer, as it showcases the innocence and curiosity of childhood.
110
  ```
111
 
112
- # Fine-Tune LLaVA-Llama-3 on Your Video Instruction Data
113
- Please refer to a forked [LLaVA-Video-Llama-3](https://github.com/Victorwz/LLaVA-Video-Llama-3) git repo for fine-tuning data preparation and scripts. The data loading function and fastchat conversation template are changed due to a different tokenizer.
114
-
115
-
116
  ## Citation
117
 
118
  ```bibtex
119
- @misc{wang2024llavavideollama3,
120
- title={LLaVA-Video-Llama-3: A Video Understanding Multimodal LLM based on Llama-3-8B LLM backbone},
121
- author={Wang, Weizhi},
122
- year={2024}
 
123
  }
124
  ```
 
7
  - en
8
  ---
9
 
10
+ # Model Card for LaViA-Llama-3-8b
11
 
12
  <!-- Provide a quick summary of what the model is/does. -->
13
 
14
+ Please follow my github repo [LaViA](https://github.com/Victorwz/LaViA) for more details on fine-tuning LaViA model with Llama-3 as the foundatiaon LLM.
 
 
 
 
15
 
16
  ## Model Details
17
  - Video Frame Sampling: Considering we adopt CLIP-ViT-L-336px as the image encoder (576 tokens for one image) and the context window of LLaMA-3 is 8k, the video frame sampling rate is set as max(30, num_frames//10).
 
20
 
21
  ## How to Use
22
 
23
+ Please firstly install lavia via
24
  ```
25
+ git clone https://github.com/Victorwz/LaViA
26
+ cd LaViA-video-sft
27
+ pip install -e ./
28
  ```
29
 
30
  You can load the model and perform inference as follows:
 
43
 
44
  # load model and processor
45
  device = "cuda" if torch.cuda.is_available() else "cpu"
46
+ model_name = get_model_name_from_path("weizhiwang/weizhiwang/LaViA-Llama-38b")
47
+ tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/LaViA-Llama-38b", None, model_name, False, False, device=device)
48
 
49
  # prepare image input
50
  url = "https://github.com/PKU-YuanGroup/Video-LLaVA/raw/main/videollava/serve/examples/sample_demo_1.mp4"
 
107
  The video is funny because it shows a baby girl wearing glasses and reading a book, which is an unusual and amusing sight. It is not common to see a baby wearing glasses and engaging in a reading activity, as they are still developing their motor skills and cognitive abilities. The image captures a cute and endearing moment, as the baby appears to be enjoying her time and learning to read. This scene can evoke a sense of warmth and delight in the viewer, as it showcases the innocence and curiosity of childhood.
108
  ```
109
 
 
 
 
 
110
  ## Citation
111
 
112
  ```bibtex
113
+ @misc{wang2024LaViA,
114
+ title={LaViA: Fine-Tuning Multimodal LLMs as Task Assistants with Video Instructions},
115
+ url={https://github.com/Victorwz/LaViA},
116
+ author={Wang, Weizhi and Luo, Xuan and Yan, Xifeng},
117
+ year={2024},
118
  }
119
  ```