LaViA-Llama-3-8b / README.md

Update README.md

b70c5a1 verified 4 months ago

4.38 kB

	---
	license: cc
	datasets:
	- liuhaotian/LLaVA-Instruct-150K
	- liuhaotian/LLaVA-Pretrain
	language:
	- en
	---

	# Model Card for LaViA-Llama-3-8b

	<!-- Provide a quick summary of what the model is/does. -->

	Please follow my github repo [LaViA](https://github.com/Victorwz/LaViA) for more details on fine-tuning LaViA model with Llama-3 as the foundatiaon LLM.

	## Model Details
	- Video Frame Sampling: Considering we adopt CLIP-ViT-L-336px as the image encoder (576 tokens for one image) and the context window of LLaMA-3 is 8k, the video frame sampling rate is set as max(30, num_frames//10).
	- Template: We follow the LLaVA-v1 template for constructing the conversation.
	- Architecture: LLaVA architecture, visual encoder + MLP adapter + LLM backbone

	## How to Use

	Please firstly install lavia via
	```
	git clone https://github.com/Victorwz/LaViA
	cd LaViA-video-sft
	pip install -e ./
	```

	You can load the model and perform inference as follows:
	```python
	from llava.conversation import conv_templates, SeparatorStyle
	from llava.model.builder import load_pretrained_model
	from llava.mm_utils import tokenizer_image_token, process_images, get_model_name_from_path
	from PIL import Image
	import requests
	import cv2
	import torch
	import base64
	import io
	from io import BytesIO
	import numpy as np

	# load model and processor
	device = "cuda" if torch.cuda.is_available() else "cpu"
	model_name = get_model_name_from_path("weizhiwang/weizhiwang/LaViA-Llama-38b")
	tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/LaViA-Llama-38b", None, model_name, False, False, device=device)

	# prepare image input
	url = "https://github.com/PKU-YuanGroup/Video-LLaVA/raw/main/videollava/serve/examples/sample_demo_1.mp4"

	def read_video(video_url):
	response = requests.get(url)
	if response.status_code != 200:
	print("Failed to download video")
	exit()
	else:
	with open("tmp_video.mp4", 'wb') as f:
	for chunk in response.iter_content(chunk_size=1024):
	f.write(chunk)

	video = cv2.VideoCapture("tmp_video.mp4")

	base64Frames = []
	while video.isOpened():
	success, frame = video.read()
	if not success:
	break
	_, buffer = cv2.imencode(".jpg", frame)
	base64Frames.append(base64.b64encode(buffer).decode("utf-8"))

	video.release()
	print(len(base64Frames), "frames read.")
	return base64Frames

	video_frames = read_video(video_url=url)
	image_tensors = []
	samplng_interval = int(len(video_frames) / 10)
	for i in range(0, len(video_frames), samplng_interval):
	rawbytes = base64.b64decode(video_frames[i])
	image = Image.open(io.BytesIO(rawbytes)).convert("RGB")
	image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0].half().cuda()
	image_tensors.append(image_tensor)

	# prepare inputs for the model
	text = "\n".join(['<image>' for i in range(len(image_tensors))]) + '\n' + "Why is this video funny"
	conv = conv_templates["llama_3"].copy()
	conv.append_message(conv.roles[0], text)
	conv.append_message(conv.roles[1], None)
	prompt = conv.get_prompt()
	input_ids = tokenizer_image_token(prompt, tokenizer, -200, return_tensors='pt').unsqueeze(0).cuda()

	# autoregressively generate text
	with torch.inference_mode():
	output_ids = model.generate(
	input_ids,
	images=image_tensors,
	do_sample=False,
	max_new_tokens=512,
	use_cache=True)

	outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
	print(outputs[0])
	```
	The image caption results look like:
	```
	The video is funny because it shows a baby girl wearing glasses and reading a book, which is an unusual and amusing sight. It is not common to see a baby wearing glasses and engaging in a reading activity, as they are still developing their motor skills and cognitive abilities. The image captures a cute and endearing moment, as the baby appears to be enjoying her time and learning to read. This scene can evoke a sense of warmth and delight in the viewer, as it showcases the innocence and curiosity of childhood.
	```

	## Citation

	```bibtex
	@misc{wang2024LaViA,
	title={LaViA: Fine-Tuning Multimodal LLMs as Task Assistants with Video Instructions},
	url={https://github.com/Victorwz/LaViA},
	author={Wang, Weizhi and Luo, Xuan and Yan, Xifeng},
	year={2024},
	}
	```