--- library_name: transformers tags: - multi-modal - large-language-model - video-language-model license: apache-2.0 datasets: - lmms-lab/LLaVA-OneVision-Data - allenai/pixmo-docs - HuggingFaceM4/Docmatix - lmms-lab/LLaVA-Video-178K - ShareGPT4Video/ShareGPT4Video language: - en metrics: - accuracy pipeline_tag: any-to-any base_model: - Qwen/Qwen2.5-7B-Instruct - DAMO-NLP-SG/VideoLLaMA3-7B-Image ---

VideoLLaMA 3: Frontier Multimodal Foundation Models for Video Understanding

[\[🤗 HF Demo\]](https://huggingface.co./spaces/lixin4ever/VideoLLaMA2)
If you like our project, please give us a star ⭐ on Github for the latest update.
## 📰 News * **[2024.01.22]** Release models and inference code of VideoLLaMA 3. ## 🌟 Introduction VideoLLaMA 3 represents a state-of-the-art series of multimodal foundation models designed to excel in both image and video understanding tasks. Leveraging advanced architectures, VideoLLaMA 3 demonstrates exceptional capabilities in processing and interpreting visual content across various contexts. These models are specifically designed to address complex multimodal challenges, such as integrating textual and visual information, extracting insights from sequential video data, and performing high-level reasoning over both dynamic and static visual scenes. ## 🌎 Model Zoo | Model | Base Model | HF Link | | -------------------- | ------------ | ------------------------------------------------------------ | | VideoLLaMA3-7B (**This Checkpoint**) | Qwen2.5-7B | [DAMO-NLP-SG/VideoLLaMA3-7B](https://huggingface.co./DAMO-NLP-SG/VideoLLaMA3-7B) | | VideoLLaMA3-2B | Qwen2.5-1.5B | [DAMO-NLP-SG/VideoLLaMA3-2B](https://huggingface.co./DAMO-NLP-SG/VideoLLaMA3-2B) | | VideoLLaMA3-7B-Image | Qwen2.5-7B | [DAMO-NLP-SG/VideoLLaMA3-7B-Image](https://huggingface.co./DAMO-NLP-SG/VideoLLaMA3-7B-Image) | | VideoLLaMA3-2B-Image | Qwen2.5-1.5B | [DAMO-NLP-SG/VideoLLaMA3-2B-Image](https://huggingface.co./DAMO-NLP-SG/VideoLLaMA3-2B-Image) | We also upload the tuned vision encoder of VideoLLaMA3-7B for wider application: | Model | Base Model | HF Link | | ----------------------------- | ------------------------- | ------------------------------------------------------------ | | VideoLLaMA3-7B Vision Encoder | siglip-so400m-patch14-384 | [DAMO-NLP-SG/VL3-SigLIP-NaViT](https://huggingface.co./DAMO-NLP-SG/VL3-SigLIP-NaViT) | ## 🚀 Main Results image * \* denotes the reproduced results. ## 🤖 Quick Start ```python import torch from transformers import AutoModelForCausalLM, AutoProcessor, AutoModel, AutoImageProcessor model_name = "DAMO-NLP-SG/VideoLLaMA3-7B" model = AutoModelForCausalLM.from_pretrained( model_name, trust_remote_code=True, device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", ) processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True) # Video conversation conversation = [ {"role": "system", "content": "You are a helpful assistant."}, { "role": "user", "content": [ {"type": "video", "data": {"video_path": "https://github.com/DAMO-NLP-SG/VideoLLaMA3/raw/refs/heads/main/assets/cat_and_chicken.mp4", "fps": 1, "max_frames": 128}}, {"type": "text", "data": "What is the cat doing?"}, ] }, ] inputs = processor(conversation=conversation, return_tensors="pt") inputs = {k: v.cuda() if isinstance(v, torch.Tensor) else v for k, v in inputs.items()} if "pixel_values" in inputs: inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16) output_ids = model.generate(**inputs, max_new_tokens=128) response = processor.batch_decode(output_ids, skip_special_tokens=True)[0].strip() print(response) ``` ## Citation If you find VideoLLaMA useful for your research and applications, please cite using this BibTeX: ```bibtex @article{damonlpsg2025videollama3, title={VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding}, author={Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao}, journal={arXiv preprint arXiv:2501.xxxxx}, year={2025}, url = {https://arxiv.org/abs/2501.xxxxx} } @article{damonlpsg2024videollama2, title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs}, author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong}, journal={arXiv preprint arXiv:2406.07476}, year={2024}, url = {https://arxiv.org/abs/2406.07476} } @article{damonlpsg2023videollama, title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding}, author = {Zhang, Hang and Li, Xin and Bing, Lidong}, journal = {arXiv preprint arXiv:2306.02858}, year = {2023}, url = {https://arxiv.org/abs/2306.02858} } ``` ## 👍 Acknowledgement Our VideoLLaMA3 is built on top of [**SigLip**](https://huggingface.co./google/siglip-so400m-patch14-384) and [**Qwen2.5**](https://github.com/QwenLM/Qwen2.5). We also learned a lot from the implementation of [**LLaVA-OneVision**](https://github.com/LLaVA-VL/LLaVA-NeXT), [**InternVL2**](https://internvl.github.io/blog/2024-07-02-InternVL-2.0/), and [**Qwen2VL**](https://github.com/QwenLM/Qwen2-VL). Besides, our VideoLLaMA3 benefits from tons of open-source efforts. We sincerely appreciate these efforts and compile a list in [ACKNOWLEDGEMENT.md](https://github.com/DAMO-NLP-SG/VideoLLaMA3/blob/main/ACKNOWLEDGEMENT.md) to express our gratitude. If your work is used in VideoLLaMA3 but not mentioned in either this repo or the technical report, feel free to let us know :heart:. ## 🔒 License This project is released under the Apache 2.0 license as found in the LICENSE file. The service is a research preview intended for **non-commercial use ONLY**, subject to the model Licenses of Qwen, Terms of Use of the data generated by OpenAI and Gemini, and Privacy Practices of ShareGPT. Please get in touch with us if you find any potential violations.