File size: 4,572 Bytes
592e852
 
50d747a
 
 
 
 
 
 
 
 
592e852
 
 
50d747a
 
 
592e852
 
557050f
592e852
50d747a
592e852
 
50d747a
 
592e852
50d747a
592e852
50d747a
592e852
50d747a
592e852
50d747a
592e852
ed2c154
50d747a
 
 
bd1afa4
592e852
50d747a
592e852
 
 
50d747a
 
 
21cb29e
 
592e852
50d747a
21cb29e
 
592e852
21cb29e
50d747a
 
 
 
 
 
21cb29e
 
 
 
50d747a
 
21cb29e
50d747a
592e852
 
 
50d747a
592e852
50d747a
 
 
 
 
68771d7
50d747a
68771d7
50d747a
592e852
50d747a
 
 
 
 
 
 
592e852
50d747a
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
---
library_name: transformers
tags:
- visual-encoder
- multi-modal-large-language-model
license: apache-2.0
language:
- en
base_model:
- google/siglip-so400m-patch14-384
pipeline_tag: image-feature-extraction
---


<p align="center">
    <img src="https://cdn-uploads.huggingface.co/production/uploads/626938b16f8f86ad21deb989/543Eaf__U-a9Z72LPGWgC.png" width="150" style="margin-bottom: 0.2;"/>
<p>


<h3 align="center">The visual encoder of <a href="https://arxiv.org/abs/2501.13106">VideoLLaMA 3: Frontier Multimodal Foundation Models for Video Understanding</a></h3>

<h5 align="center"> If you like our project, please give us a star โญ on <a href="https://github.com/DAMO-NLP-SG/VideoLLaMA3">Github</a> for the latest update.  </h5>


## ๐ŸŒŸ Introduction
This model serves as the visual encoder in VideoLLaMA3.

VideoLLaMA3 leverages the Any-resolution Vision Tokenization (AVT) approach to dynamically process images and videos of varying resolutions. This is accomplished by adapting the pre-trained vision encoder (based on ViT architecture) to use 2D-RoPE (Rotary Position Embeddings), replacing the absolute position embeddings traditionally used in ViT.

With AVT, VideoLLaMA3 is able to represent images and videos with greater detail across different resolutions, enriching the vision tokens with more information. To ensure seamless integration with AVT, we fine-tune both the vision encoder and the projector during the Vision Encoder Adaptation stage (Stage #1 in the VideoLLaMA3 training pipeline) using scene images, document data, and scene images with text.

Before training, the model parameters and architecture are initialized from [SigLip](https://huggingface.co./google/siglip-so400m-patch14-384).

## ๐Ÿš€ Model Porfermance

| Base Model                           | GQA        | AI2D       | ChartQA     | DocVQA<sub>val</sub>     | MME        |
|---------------------------------|------------|------------|-------------|--------------------------|------------|
| clip-vit-large-patch14-336      |   61.50    |   56.28    |   18.32     |   24.86                  | **1668.41**|
| dfn5B-clip-vit-h-14-378         |   62.70    |   56.87    |   16.40     |   23.09                  |   1665.35  |
| siglip-so400m-patch14-384 **(Our Implementation)**       | **62.92**  | **57.12**  | **22.44**   | **31.32**                |   1667.92  |

* A more detailed analysis can be found in our [paper](https://arxiv.org/abs/2501.13106).



## ๐Ÿค– Quick Start
```python
import torch
from transformers import AutoModel, AutoImageProcessor
from transformers.image_utils import load_image

model_name = "DAMO-NLP-SG/VL3-SigLIP-NaViT"
image_path = "https://github.com/DAMO-NLP-SG/VideoLLaMA3/blob/main/assets/sora.png?raw=true"
images = load_image(image_path)

model = AutoModel.from_pretrained(
    model_name,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
processor = AutoImageProcessor.from_pretrained(model_name, trust_remote_code=True)

inputs = processor(images=images, merge_size=1)
inputs = {k: torch.tensor(v).cuda() for k, v in inputs.items()}
if "pixel_values" in inputs:
    inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
image_features = model(**inputs)
```



## Citation

If you find VideoLLaMA useful for your research and applications, please cite using this BibTeX:
```bibtex
@article{damonlpsg2025videollama3,
  title={VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding},
  author={Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao},
  journal={arXiv preprint arXiv:2501.13106},
  year={2025},
  url = {https://arxiv.org/abs/2501.13106}
}

@article{damonlpsg2024videollama2,
  title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
  author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
  journal={arXiv preprint arXiv:2406.07476},
  year={2024},
  url = {https://arxiv.org/abs/2406.07476}
}

@article{damonlpsg2023videollama,
  title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
  author = {Zhang, Hang and Li, Xin and Bing, Lidong},
  journal = {arXiv preprint arXiv:2306.02858},
  year = {2023},
  url = {https://arxiv.org/abs/2306.02858}
}
```