nielsr HF staff commited on
Commit
3fda507
Β·
verified Β·
1 Parent(s): 2389db4

Update model card

Browse files

This PR improves the model card by:

- updating the `pipeline_tag` to `any-to-any`
- linking to the paper page
- adding a link to the Github repository

Files changed (1) hide show
  1. README.md +7 -9
README.md CHANGED
@@ -15,18 +15,17 @@ language:
15
  - en
16
  metrics:
17
  - accuracy
18
- pipeline_tag: visual-question-answering
19
  base_model:
20
  - Qwen/Qwen2.5-1.5B-Instruct
21
  ---
22
 
23
-
24
  <p align="center">
25
  <img src="https://cdn-uploads.huggingface.co/production/uploads/626938b16f8f86ad21deb989/tt5KYnAUmQlHtfB1-Zisl.png" width="150" style="margin-bottom: 0.2;"/>
26
  <p>
27
 
28
 
29
- <h3 align="center"><a href="https://arxiv.org/abs/2501.13106">VideoLLaMA 3: Frontier Multimodal Foundation Models for Video Understanding</a></h3>
30
 
31
  <h5 align="center">
32
 
@@ -37,6 +36,7 @@ base_model:
37
 
38
  <h5 align="center"> If you like our project, please give us a star ⭐ on <a href="https://github.com/DAMO-NLP-SG/VideoLLaMA3">Github</a> for the latest update. </h5>
39
 
 
40
 
41
  ## πŸ“° News
42
  <!-- * **[2024.01.23]** πŸ‘‹πŸ‘‹ Update technical report. If you have works closely related to VideoLLaMA3 but not mentioned in the paper, feel free to let us know.
@@ -47,9 +47,6 @@ base_model:
47
  VideoLLaMA 3 represents a state-of-the-art series of multimodal foundation models designed to excel in both image and video understanding tasks. Leveraging advanced architectures, VideoLLaMA 3 demonstrates exceptional capabilities in processing and interpreting visual content across various contexts. These models are specifically designed to address complex multimodal challenges, such as integrating textual and visual information, extracting insights from sequential video data, and performing high-level reasoning over both dynamic and static visual scenes.
48
 
49
 
50
-
51
-
52
-
53
  ## 🌎 Model Zoo
54
  | Model | Base Model | HF Link |
55
  | -------------------- | ------------ | ------------------------------------------------------------ |
@@ -109,7 +106,6 @@ response = processor.batch_decode(output_ids, skip_special_tokens=True)[0].strip
109
  print(response)
110
  ```
111
 
112
-
113
  ## Citation
114
 
115
  If you find VideoLLaMA useful for your research and applications, please cite using this BibTeX:
@@ -119,7 +115,7 @@ If you find VideoLLaMA useful for your research and applications, please cite us
119
  author={Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao},
120
  journal={arXiv preprint arXiv:2501.xxxxx},
121
  year={2025},
122
- url = {https://arxiv.org/abs/2501.xxxxx}
123
  }
124
 
125
  @article{damonlpsg2024videollama2,
@@ -137,4 +133,6 @@ If you find VideoLLaMA useful for your research and applications, please cite us
137
  year = {2023},
138
  url = {https://arxiv.org/abs/2306.02858}
139
  }
140
- ```
 
 
 
15
  - en
16
  metrics:
17
  - accuracy
18
+ pipeline_tag: any-to-any
19
  base_model:
20
  - Qwen/Qwen2.5-1.5B-Instruct
21
  ---
22
 
 
23
  <p align="center">
24
  <img src="https://cdn-uploads.huggingface.co/production/uploads/626938b16f8f86ad21deb989/tt5KYnAUmQlHtfB1-Zisl.png" width="150" style="margin-bottom: 0.2;"/>
25
  <p>
26
 
27
 
28
+ <h3 align="center"><a href="https://huggingface.co/papers/2501.13106">VideoLLaMA 3: Frontier Multimodal Foundation Models for Video Understanding</a></h3>
29
 
30
  <h5 align="center">
31
 
 
36
 
37
  <h5 align="center"> If you like our project, please give us a star ⭐ on <a href="https://github.com/DAMO-NLP-SG/VideoLLaMA3">Github</a> for the latest update. </h5>
38
 
39
+ This repository contains the model described in the paper [VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding](https://huggingface.co/papers/2501.13106).
40
 
41
  ## πŸ“° News
42
  <!-- * **[2024.01.23]** πŸ‘‹πŸ‘‹ Update technical report. If you have works closely related to VideoLLaMA3 but not mentioned in the paper, feel free to let us know.
 
47
  VideoLLaMA 3 represents a state-of-the-art series of multimodal foundation models designed to excel in both image and video understanding tasks. Leveraging advanced architectures, VideoLLaMA 3 demonstrates exceptional capabilities in processing and interpreting visual content across various contexts. These models are specifically designed to address complex multimodal challenges, such as integrating textual and visual information, extracting insights from sequential video data, and performing high-level reasoning over both dynamic and static visual scenes.
48
 
49
 
 
 
 
50
  ## 🌎 Model Zoo
51
  | Model | Base Model | HF Link |
52
  | -------------------- | ------------ | ------------------------------------------------------------ |
 
106
  print(response)
107
  ```
108
 
 
109
  ## Citation
110
 
111
  If you find VideoLLaMA useful for your research and applications, please cite using this BibTeX:
 
115
  author={Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao},
116
  journal={arXiv preprint arXiv:2501.xxxxx},
117
  year={2025},
118
+ url = {https://arxiv.org/abs/2501.13106}
119
  }
120
 
121
  @article{damonlpsg2024videollama2,
 
133
  year = {2023},
134
  url = {https://arxiv.org/abs/2306.02858}
135
  }
136
+ ```
137
+
138
+ Github repository: https://github.com/DAMO-NLP-SG/VideoLLaMA3