|
--- |
|
license: cc |
|
datasets: |
|
- liuhaotian/LLaVA-Instruct-150K |
|
- liuhaotian/LLaVA-Pretrain |
|
language: |
|
- en |
|
--- |
|
|
|
# Model Card for LaViA-Llama-3-8b |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
Please follow my github repo [LaViA](https://github.com/Victorwz/LaViA) for more details on fine-tuning LaViA model with Llama-3 as the foundatiaon LLM. |
|
|
|
## Model Details |
|
- Video Frame Sampling: Considering we adopt CLIP-ViT-L-336px as the image encoder (576 tokens for one image) and the context window of LLaMA-3 is 8k, the video frame sampling rate is set as max(30, num_frames//10). |
|
- Template: We follow the LLaVA-v1 template for constructing the conversation. |
|
- Architecture: LLaVA architecture, visual encoder + MLP adapter + LLM backbone |
|
|
|
## How to Use |
|
|
|
Please firstly install lavia via |
|
``` |
|
git clone https://github.com/Victorwz/LaViA |
|
cd LaViA-video-sft |
|
pip install -e ./ |
|
``` |
|
|
|
You can load the model and perform inference as follows: |
|
```python |
|
from llava.conversation import conv_templates, SeparatorStyle |
|
from llava.model.builder import load_pretrained_model |
|
from llava.mm_utils import tokenizer_image_token, process_images, get_model_name_from_path |
|
from PIL import Image |
|
import requests |
|
import cv2 |
|
import torch |
|
import base64 |
|
import io |
|
from io import BytesIO |
|
import numpy as np |
|
|
|
# load model and processor |
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
model_name = get_model_name_from_path("weizhiwang/weizhiwang/LaViA-Llama-38b") |
|
tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/LaViA-Llama-38b", None, model_name, False, False, device=device) |
|
|
|
# prepare image input |
|
url = "https://github.com/PKU-YuanGroup/Video-LLaVA/raw/main/videollava/serve/examples/sample_demo_1.mp4" |
|
|
|
def read_video(video_url): |
|
response = requests.get(url) |
|
if response.status_code != 200: |
|
print("Failed to download video") |
|
exit() |
|
else: |
|
with open("tmp_video.mp4", 'wb') as f: |
|
for chunk in response.iter_content(chunk_size=1024): |
|
f.write(chunk) |
|
|
|
video = cv2.VideoCapture("tmp_video.mp4") |
|
|
|
base64Frames = [] |
|
while video.isOpened(): |
|
success, frame = video.read() |
|
if not success: |
|
break |
|
_, buffer = cv2.imencode(".jpg", frame) |
|
base64Frames.append(base64.b64encode(buffer).decode("utf-8")) |
|
|
|
video.release() |
|
print(len(base64Frames), "frames read.") |
|
return base64Frames |
|
|
|
video_frames = read_video(video_url=url) |
|
image_tensors = [] |
|
samplng_interval = int(len(video_frames) / 10) |
|
for i in range(0, len(video_frames), samplng_interval): |
|
rawbytes = base64.b64decode(video_frames[i]) |
|
image = Image.open(io.BytesIO(rawbytes)).convert("RGB") |
|
image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0].half().cuda() |
|
image_tensors.append(image_tensor) |
|
|
|
# prepare inputs for the model |
|
text = "\n".join(['<image>' for i in range(len(image_tensors))]) + '\n' + "Why is this video funny" |
|
conv = conv_templates["llama_3"].copy() |
|
conv.append_message(conv.roles[0], text) |
|
conv.append_message(conv.roles[1], None) |
|
prompt = conv.get_prompt() |
|
input_ids = tokenizer_image_token(prompt, tokenizer, -200, return_tensors='pt').unsqueeze(0).cuda() |
|
|
|
# autoregressively generate text |
|
with torch.inference_mode(): |
|
output_ids = model.generate( |
|
input_ids, |
|
images=image_tensors, |
|
do_sample=False, |
|
max_new_tokens=512, |
|
use_cache=True) |
|
|
|
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True) |
|
print(outputs[0]) |
|
``` |
|
The image caption results look like: |
|
``` |
|
The video is funny because it shows a baby girl wearing glasses and reading a book, which is an unusual and amusing sight. It is not common to see a baby wearing glasses and engaging in a reading activity, as they are still developing their motor skills and cognitive abilities. The image captures a cute and endearing moment, as the baby appears to be enjoying her time and learning to read. This scene can evoke a sense of warmth and delight in the viewer, as it showcases the innocence and curiosity of childhood. |
|
``` |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@misc{wang2024LaViA, |
|
title={LaViA: Fine-Tuning Multimodal LLMs as Task Assistants with Video Instructions}, |
|
url={https://github.com/Victorwz/LaViA}, |
|
author={Wang, Weizhi and Luo, Xuan and Yan, Xifeng}, |
|
year={2024}, |
|
} |
|
``` |
|
|