ynhe commited on
Commit
4a1da20
·
verified ·
1 Parent(s): 30b9ee4

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +83 -0
README.md ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: transformers
5
+ license: apache-2.0
6
+ metrics:
7
+ - accuracy
8
+ tags:
9
+ - multimodal
10
+ pipeline_tag: video-text-to-text
11
+ ---
12
+
13
+ # 📕InternVL_2_5_HiCo_R16 ⚡
14
+ <!-- [\[📰 Blog\]](https://internvideo.github.io/blog/2024-12-31-VideoChat-Flash) -->
15
+ [\[📂 GitHub\]](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5)
16
+ [\[📜 Tech Report\]](https://arxiv.org/abs/2501.12386)
17
+ <!-- [\[🗨️ Chat Demo\]](https://huggingface.co/spaces/OpenGVLab/VideoChat-Flash) -->
18
+
19
+
20
+ ## 📈 Performance
21
+ | Model | MVBench | LongVideoBench | VideoMME(w/o sub)|
22
+ | --- | --- | --- | --- |
23
+ |InternVL_2_5_HiCo_R16| - | - | - |
24
+
25
+ ## 🚀 How to use the model
26
+
27
+ First, you need to install [flash attention2](https://github.com/Dao-AILab/flash-attention) and some other modules. We provide a simple installation example below:
28
+ ```
29
+ pip install transformers==4.40.1
30
+ pip install av
31
+ pip install imageio
32
+ pip install decord
33
+ pip install opencv-python
34
+ pip install flash-attn --no-build-isolation
35
+ ```
36
+ Then you could use our model:
37
+ ```python
38
+ from transformers import AutoModel, AutoTokenizer
39
+
40
+ # model setting
41
+ model_path = 'OpenGVLab/InternVL_2_5_HiCo_R16'
42
+
43
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
44
+ model = AutoModel.from_pretrained(model_path, trust_remote_code=True).half().cuda()
45
+ image_processor = model.get_vision_tower().image_processor
46
+
47
+
48
+ # evaluation setting
49
+ max_num_frames = 512
50
+ generation_config = dict(
51
+ do_sample=False,
52
+ temperature=0.0,
53
+ max_new_tokens=1024,
54
+ top_p=0.1,
55
+ num_beams=1
56
+ )
57
+
58
+ video_path = "your_video.mp4"
59
+
60
+ # single-turn conversation
61
+ question1 = "Describe this video in detail."
62
+ output1, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question1, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)
63
+
64
+ print(output1)
65
+
66
+ # multi-turn conversation
67
+ question2 = "How many people appear in the video?"
68
+ output2, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question2, chat_history=chat_history, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)
69
+
70
+ print(output2)
71
+ ```
72
+
73
+ ## ✏️ Citation
74
+
75
+ ```bibtex
76
+
77
+ @article{wang2025internvideo,
78
+ title={InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling},
79
+ author={Wang, Yi and Li, Xinhao and Yan, Ziang and He, Yinan and Yu, Jiashuo and Zeng, Xiangyu and Wang, Chenting and Ma, Changlian and Huang, Haian and Gao, Jianfei and Dou, Min and Chen, Kai and Wang, Wenhai and Qiao, Yu and Wang, Yali and Wang, Limin},
80
+ journal={arXiv preprint arXiv:2501.12386},
81
+ year={2025}
82
+ }
83
+ ```