|
--- |
|
language: |
|
- en |
|
library_name: transformers |
|
license: apache-2.0 |
|
metrics: |
|
- accuracy |
|
tags: |
|
- multimodal |
|
pipeline_tag: video-text-to-text |
|
--- |
|
|
|
# πInternVL_2_5_HiCo_R16 β‘ |
|
<!-- [\[π° Blog\]](https://internvideo.github.io/blog/2024-12-31-VideoChat-Flash) --> |
|
[\[π GitHub\]](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5) |
|
[\[π Tech Report\]](https://arxiv.org/abs/2501.12386) |
|
<!-- [\[π¨οΈ Chat Demo\]](https://huggingface.co./spaces/OpenGVLab/VideoChat-Flash) --> |
|
|
|
|
|
## π Performance |
|
| Model | MVBench | LongVideoBench | VideoMME(w/o sub)| |
|
| --- | --- | --- | --- | |
|
|InternVL_2_5_HiCo_R16| - | - | - | |
|
|
|
## π How to use the model |
|
|
|
First, you need to install [flash attention2](https://github.com/Dao-AILab/flash-attention) and some other modules. We provide a simple installation example below: |
|
``` |
|
pip install transformers==4.40.1 |
|
pip install av |
|
pip install imageio |
|
pip install decord |
|
pip install opencv-python |
|
pip install flash-attn --no-build-isolation |
|
``` |
|
Then you could use our model: |
|
```python |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
# model setting |
|
model_path = 'OpenGVLab/InternVL_2_5_HiCo_R16' |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) |
|
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).half().cuda() |
|
image_processor = model.get_vision_tower().image_processor |
|
|
|
|
|
# evaluation setting |
|
max_num_frames = 512 |
|
generation_config = dict( |
|
do_sample=False, |
|
temperature=0.0, |
|
max_new_tokens=1024, |
|
top_p=0.1, |
|
num_beams=1 |
|
) |
|
|
|
video_path = "your_video.mp4" |
|
|
|
# single-turn conversation |
|
question1 = "Describe this video in detail." |
|
output1, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question1, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config) |
|
|
|
print(output1) |
|
|
|
# multi-turn conversation |
|
question2 = "How many people appear in the video?" |
|
output2, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question2, chat_history=chat_history, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config) |
|
|
|
print(output2) |
|
``` |
|
|
|
## βοΈ Citation |
|
|
|
```bibtex |
|
|
|
@article{wang2025internvideo, |
|
title={InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling}, |
|
author={Wang, Yi and Li, Xinhao and Yan, Ziang and He, Yinan and Yu, Jiashuo and Zeng, Xiangyu and Wang, Chenting and Ma, Changlian and Huang, Haian and Gao, Jianfei and Dou, Min and Chen, Kai and Wang, Wenhai and Qiao, Yu and Wang, Yali and Wang, Limin}, |
|
journal={arXiv preprint arXiv:2501.12386}, |
|
year={2025} |
|
} |
|
``` |