DAMO-NLP-SG
/

VideoLLaMA3-2B-Image

This PR improves the model card by:

- updating the `pipeline_tag` to `any-to-any`
- linking to the paper page
- adding a link to the Github repository

Files changed (1) hide show

README.md +7 -9

README.md CHANGED Viewed

@@ -15,18 +15,17 @@ language:
 - en
 metrics:
 - accuracy
-pipeline_tag: visual-question-answering
 base_model:
 - Qwen/Qwen2.5-1.5B-Instruct
 ---
 <p align="center">
     <img src="https://cdn-uploads.huggingface.co/production/uploads/626938b16f8f86ad21deb989/tt5KYnAUmQlHtfB1-Zisl.png" width="150" style="margin-bottom: 0.2;"/>
 <p>
-<h3 align="center"><a href="https://arxiv.org/abs/2501.13106">VideoLLaMA 3: Frontier Multimodal Foundation Models for Video Understanding</a></h3>
 <h5 align="center">
@@ -37,6 +36,7 @@ base_model:
 <h5 align="center"> If you like our project, please give us a star ⭐ on <a href="https://github.com/DAMO-NLP-SG/VideoLLaMA3">Github</a> for the latest update.  </h5>
 ## 📰 News
 <!-- * **[2024.01.23]**  👋👋 Update technical report. If you have works closely related to VideoLLaMA3 but not mentioned in the paper, feel free to let us know.
@@ -47,9 +47,6 @@ base_model:
 VideoLLaMA 3 represents a state-of-the-art series of multimodal foundation models designed to excel in both image and video understanding tasks. Leveraging advanced architectures, VideoLLaMA 3 demonstrates exceptional capabilities in processing and interpreting visual content across various contexts. These models are specifically designed to address complex multimodal challenges, such as integrating textual and visual information, extracting insights from sequential video data, and performing high-level reasoning over both dynamic and static visual scenes.
 ## 🌎 Model Zoo
 | Model                | Base Model   | HF Link                                                      |
 | -------------------- | ------------ | ------------------------------------------------------------ |
@@ -109,7 +106,6 @@ response = processor.batch_decode(output_ids, skip_special_tokens=True)[0].strip
 print(response)
 ```
 ## Citation
 If you find VideoLLaMA useful for your research and applications, please cite using this BibTeX:
@@ -119,7 +115,7 @@ If you find VideoLLaMA useful for your research and applications, please cite us
   author={Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao},
   journal={arXiv preprint arXiv:2501.xxxxx},
   year={2025},
-  url = {https://arxiv.org/abs/2501.xxxxx}
 }
 @article{damonlpsg2024videollama2,
@@ -137,4 +133,6 @@ If you find VideoLLaMA useful for your research and applications, please cite us
   year = {2023},
   url = {https://arxiv.org/abs/2306.02858}
 }
-```

 - en
 metrics:
 - accuracy
+pipeline_tag: any-to-any
 base_model:
 - Qwen/Qwen2.5-1.5B-Instruct
 ---
 <p align="center">
     <img src="https://cdn-uploads.huggingface.co/production/uploads/626938b16f8f86ad21deb989/tt5KYnAUmQlHtfB1-Zisl.png" width="150" style="margin-bottom: 0.2;"/>
 <p>
+<h3 align="center"><a href="https://huggingface.co/papers/2501.13106">VideoLLaMA 3: Frontier Multimodal Foundation Models for Video Understanding</a></h3>
 <h5 align="center">
 <h5 align="center"> If you like our project, please give us a star ⭐ on <a href="https://github.com/DAMO-NLP-SG/VideoLLaMA3">Github</a> for the latest update.  </h5>
+This repository contains the model described in the paper [VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding](https://huggingface.co/papers/2501.13106).
 ## 📰 News
 <!-- * **[2024.01.23]**  👋👋 Update technical report. If you have works closely related to VideoLLaMA3 but not mentioned in the paper, feel free to let us know.
 VideoLLaMA 3 represents a state-of-the-art series of multimodal foundation models designed to excel in both image and video understanding tasks. Leveraging advanced architectures, VideoLLaMA 3 demonstrates exceptional capabilities in processing and interpreting visual content across various contexts. These models are specifically designed to address complex multimodal challenges, such as integrating textual and visual information, extracting insights from sequential video data, and performing high-level reasoning over both dynamic and static visual scenes.
 ## 🌎 Model Zoo
 | Model                | Base Model   | HF Link                                                      |
 | -------------------- | ------------ | ------------------------------------------------------------ |
 print(response)
 ```
 ## Citation
 If you find VideoLLaMA useful for your research and applications, please cite using this BibTeX:
   author={Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao},
   journal={arXiv preprint arXiv:2501.xxxxx},
   year={2025},
+  url = {https://arxiv.org/abs/2501.13106}
 }
 @article{damonlpsg2024videollama2,
   year = {2023},
   url = {https://arxiv.org/abs/2306.02858}
 }
+```
+Github repository: https://github.com/DAMO-NLP-SG/VideoLLaMA3