Update README.md
Browse files
README.md
CHANGED
@@ -7,15 +7,11 @@ language:
|
|
7 |
- en
|
8 |
---
|
9 |
|
10 |
-
# Model Card for
|
11 |
|
12 |
<!-- Provide a quick summary of what the model is/does. -->
|
13 |
|
14 |
-
Please follow my github repo [
|
15 |
-
|
16 |
-
## Updates
|
17 |
-
- [6/4/2024] The codebase supports the video data fine-tuning for video understanding tasks.
|
18 |
-
- [5/14/2024] The codebase has been upgraded to llava-next (llava-v1.6). Now it supports the latest llama-3, phi-3, mistral-v0.1-7b models.
|
19 |
|
20 |
## Model Details
|
21 |
- Video Frame Sampling: Considering we adopt CLIP-ViT-L-336px as the image encoder (576 tokens for one image) and the context window of LLaMA-3 is 8k, the video frame sampling rate is set as max(30, num_frames//10).
|
@@ -24,9 +20,11 @@ Please follow my github repo [LLaVA-Video-Llama-3](https://github.com/Victorwz/L
|
|
24 |
|
25 |
## How to Use
|
26 |
|
27 |
-
Please firstly install
|
28 |
```
|
29 |
-
|
|
|
|
|
30 |
```
|
31 |
|
32 |
You can load the model and perform inference as follows:
|
@@ -45,8 +43,8 @@ import numpy as np
|
|
45 |
|
46 |
# load model and processor
|
47 |
device = "cuda" if torch.cuda.is_available() else "cpu"
|
48 |
-
model_name = get_model_name_from_path("weizhiwang/
|
49 |
-
tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/
|
50 |
|
51 |
# prepare image input
|
52 |
url = "https://github.com/PKU-YuanGroup/Video-LLaVA/raw/main/videollava/serve/examples/sample_demo_1.mp4"
|
@@ -109,16 +107,13 @@ The image caption results look like:
|
|
109 |
The video is funny because it shows a baby girl wearing glasses and reading a book, which is an unusual and amusing sight. It is not common to see a baby wearing glasses and engaging in a reading activity, as they are still developing their motor skills and cognitive abilities. The image captures a cute and endearing moment, as the baby appears to be enjoying her time and learning to read. This scene can evoke a sense of warmth and delight in the viewer, as it showcases the innocence and curiosity of childhood.
|
110 |
```
|
111 |
|
112 |
-
# Fine-Tune LLaVA-Llama-3 on Your Video Instruction Data
|
113 |
-
Please refer to a forked [LLaVA-Video-Llama-3](https://github.com/Victorwz/LLaVA-Video-Llama-3) git repo for fine-tuning data preparation and scripts. The data loading function and fastchat conversation template are changed due to a different tokenizer.
|
114 |
-
|
115 |
-
|
116 |
## Citation
|
117 |
|
118 |
```bibtex
|
119 |
-
@misc{
|
120 |
-
|
121 |
-
|
122 |
-
|
|
|
123 |
}
|
124 |
```
|
|
|
7 |
- en
|
8 |
---
|
9 |
|
10 |
+
# Model Card for LaViA-Llama-3-8b
|
11 |
|
12 |
<!-- Provide a quick summary of what the model is/does. -->
|
13 |
|
14 |
+
Please follow my github repo [LaViA](https://github.com/Victorwz/LaViA) for more details on fine-tuning LaViA model with Llama-3 as the foundatiaon LLM.
|
|
|
|
|
|
|
|
|
15 |
|
16 |
## Model Details
|
17 |
- Video Frame Sampling: Considering we adopt CLIP-ViT-L-336px as the image encoder (576 tokens for one image) and the context window of LLaMA-3 is 8k, the video frame sampling rate is set as max(30, num_frames//10).
|
|
|
20 |
|
21 |
## How to Use
|
22 |
|
23 |
+
Please firstly install lavia via
|
24 |
```
|
25 |
+
git clone https://github.com/Victorwz/LaViA
|
26 |
+
cd LaViA-video-sft
|
27 |
+
pip install -e ./
|
28 |
```
|
29 |
|
30 |
You can load the model and perform inference as follows:
|
|
|
43 |
|
44 |
# load model and processor
|
45 |
device = "cuda" if torch.cuda.is_available() else "cpu"
|
46 |
+
model_name = get_model_name_from_path("weizhiwang/weizhiwang/LaViA-Llama-38b")
|
47 |
+
tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/LaViA-Llama-38b", None, model_name, False, False, device=device)
|
48 |
|
49 |
# prepare image input
|
50 |
url = "https://github.com/PKU-YuanGroup/Video-LLaVA/raw/main/videollava/serve/examples/sample_demo_1.mp4"
|
|
|
107 |
The video is funny because it shows a baby girl wearing glasses and reading a book, which is an unusual and amusing sight. It is not common to see a baby wearing glasses and engaging in a reading activity, as they are still developing their motor skills and cognitive abilities. The image captures a cute and endearing moment, as the baby appears to be enjoying her time and learning to read. This scene can evoke a sense of warmth and delight in the viewer, as it showcases the innocence and curiosity of childhood.
|
108 |
```
|
109 |
|
|
|
|
|
|
|
|
|
110 |
## Citation
|
111 |
|
112 |
```bibtex
|
113 |
+
@misc{wang2024LaViA,
|
114 |
+
title={LaViA: Fine-Tuning Multimodal LLMs as Task Assistants with Video Instructions},
|
115 |
+
url={https://github.com/Victorwz/LaViA},
|
116 |
+
author={Wang, Weizhi and Luo, Xuan and Yan, Xifeng},
|
117 |
+
year={2024},
|
118 |
}
|
119 |
```
|