Update README.md
Browse files
README.md
CHANGED
@@ -5,12 +5,16 @@ license_link: LICENSE
|
|
5 |
---
|
6 |
<!-- ## **HunyuanVideo** -->
|
7 |
|
|
|
|
|
8 |
<p align="center">
|
9 |
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo-I2V/refs/heads/main/assets/logo.png" height=100>
|
10 |
</p>
|
11 |
|
12 |
# **HunyuanVideo-I2V** π
|
13 |
|
|
|
|
|
14 |
Following the great successful open-sourcing of our [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), we proudly present the [HunyuanVideo-I2V](https://github.com/Tencent/HunyuanVideo-I2V), a new image-to-video generation framework to accelerate open-source community exploration!
|
15 |
|
16 |
This repo contains offical PyTorch model definitions, pre-trained weights and inference/sampling code. You can find more visualizations on our [project page](https://aivideo.hunyuan.tencent.com). Meanwhile, we have released the LoRA training code for customizable special effects, which can be used to create more interesting video effects.
|
@@ -20,15 +24,48 @@ This repo contains offical PyTorch model definitions, pre-trained weights and in
|
|
20 |
|
21 |
|
22 |
## π₯π₯π₯ News!!
|
|
|
23 |
* Mar 06, 2025: π We release the inference code and model weights of HunyuanVideo-I2V. [Download](https://github.com/Tencent/HunyuanVideo-I2V/blob/main/ckpts/README.md).
|
24 |
|
25 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
## π Open-source Plan
|
27 |
- HunyuanVideo-I2V (Image-to-Video Model)
|
28 |
-
- [x] Lora training scripts
|
29 |
- [x] Inference
|
30 |
- [x] Checkpoints
|
31 |
- [x] ComfyUI
|
|
|
32 |
- [ ] Multi-gpus Sequence Parallel inference (Faster inference speed on more gpus)
|
33 |
- [ ] Diffusers
|
34 |
- [ ] FP8 Quantified weight
|
@@ -44,20 +81,15 @@ This repo contains offical PyTorch model definitions, pre-trained weights and in
|
|
44 |
- [Installation Guide for Linux](#installation-guide-for-linux)
|
45 |
- [𧱠Download Pretrained Models](#-download-pretrained-models)
|
46 |
- [π Single-gpu Inference](#-single-gpu-inference)
|
|
|
47 |
- [Using Command Line](#using-command-line)
|
48 |
- [More Configurations](#more-configurations)
|
49 |
-
- [π Customizable I2V LoRA effects training](#-customizable-i2v-lora-effects-training)
|
50 |
-
- [Requirements](#requirements)
|
51 |
-
- [Environment](#environment)
|
52 |
-
- [Training data construction](#training-data-construction)
|
53 |
-
- [Training](#training)
|
54 |
-
- [Inference](#inference)
|
55 |
- [π BibTeX](#-bibtex)
|
56 |
- [Acknowledgements](#acknowledgements)
|
57 |
---
|
58 |
|
59 |
## **HunyuanVideo-I2V Overall Architecture**
|
60 |
-
Leveraging the advanced video generation capabilities of [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), we have extended its application to image-to-video generation tasks. To achieve this, we employ
|
61 |
|
62 |
Since we utilizes a pre-trained Multimodal Large Language Model (MLLM) with a Decoder-Only architecture as the text encoder, we can significantly enhance the model's ability to comprehend the semantic content of the input image and to seamlessly integrate information from both the image and its associated caption. Specifically, the input image is processed by the MLLM to generate semantic image tokens. These tokens are then concatenated with the video latent tokens, enabling comprehensive full-attention computation across the combined data.
|
63 |
|
@@ -135,7 +167,7 @@ docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyua
|
|
135 |
|
136 |
## 𧱠Download Pretrained Models
|
137 |
|
138 |
-
The details of download pretrained models are shown [here](
|
139 |
|
140 |
|
141 |
|
@@ -143,6 +175,17 @@ The details of download pretrained models are shown [here](https://github.com/Te
|
|
143 |
|
144 |
Similar to [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), HunyuanVideo-I2V supports high-resolution video generation, with resolution up to 720P and video length up to 129 frames (5 seconds).
|
145 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
146 |
### Using Command Line
|
147 |
|
148 |
<!-- ### Run a Gradio Server
|
@@ -152,44 +195,68 @@ python3 gradio_server.py --flow-reverse
|
|
152 |
# set SERVER_NAME and SERVER_PORT manually
|
153 |
# SERVER_NAME=0.0.0.0 SERVER_PORT=8081 python3 gradio_server.py --flow-reverse
|
154 |
``` -->
|
|
|
155 |
```bash
|
156 |
cd HunyuanVideo-I2V
|
157 |
|
158 |
python3 sample_image2video.py \
|
|
|
|
|
159 |
--model HYVideo-T/2 \
|
160 |
-
--prompt "A man with short gray hair plays a red electric guitar." \
|
161 |
--i2v-mode \
|
162 |
-
--i2v-image-path ./assets/demo/i2v/imgs/0.png \
|
163 |
--i2v-resolution 720p \
|
|
|
|
|
164 |
--video-length 129 \
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
165 |
--infer-steps 50 \
|
|
|
166 |
--flow-reverse \
|
167 |
--flow-shift 17.0 \
|
168 |
--seed 0 \
|
|
|
169 |
--use-cpu-offload \
|
170 |
-
--save-path ./results
|
171 |
```
|
172 |
### More Configurations
|
173 |
|
174 |
We list some more useful configurations for easy usage:
|
175 |
|
176 |
-
| Argument |
|
177 |
-
|
178 |
-
| `--prompt` | None
|
179 |
-
| `--model` |
|
180 |
-
| `--i2v-mode` | False
|
181 |
-
| `--i2v-image-path` | ./assets/demo/i2v/imgs/0.
|
182 |
-
| `--i2v-resolution` |
|
183 |
-
|
|
184 |
-
| `--
|
185 |
-
|
|
186 |
-
|
|
187 |
-
|
|
188 |
-
|
|
189 |
-
|
|
|
|
|
|
190 |
|
191 |
|
192 |
-
## π Customizable I2V LoRA effects training
|
193 |
|
194 |
### Requirements
|
195 |
|
@@ -216,7 +283,7 @@ Prompt description: The trigger word is written directly in the video caption. I
|
|
216 |
|
217 |
For example, AI hair growth effect (trigger): rapid_hair_growth, The hair of the characters in the video is growing rapidly. + original prompt
|
218 |
|
219 |
-
After having the training video and prompt pair, refer to [here](
|
220 |
|
221 |
|
222 |
### Training
|
@@ -259,7 +326,7 @@ We list some lora specific configurations for easy usage:
|
|
259 |
|:-------------------:|:-------:|:----------------------------:|
|
260 |
| `--use-lora` | False | Whether to open lora mode. |
|
261 |
| `--lora-scale` | 1.0 | Fusion scale for lora model. |
|
262 |
-
| `--lora-path` | "" | Weight path for lora model. |
|
263 |
|
264 |
|
265 |
## π BibTeX
|
|
|
5 |
---
|
6 |
<!-- ## **HunyuanVideo** -->
|
7 |
|
8 |
+
[δΈζι
θ―»](./README_zh.md)
|
9 |
+
|
10 |
<p align="center">
|
11 |
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo-I2V/refs/heads/main/assets/logo.png" height=100>
|
12 |
</p>
|
13 |
|
14 |
# **HunyuanVideo-I2V** π
|
15 |
|
16 |
+
-----
|
17 |
+
|
18 |
Following the great successful open-sourcing of our [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), we proudly present the [HunyuanVideo-I2V](https://github.com/Tencent/HunyuanVideo-I2V), a new image-to-video generation framework to accelerate open-source community exploration!
|
19 |
|
20 |
This repo contains offical PyTorch model definitions, pre-trained weights and inference/sampling code. You can find more visualizations on our [project page](https://aivideo.hunyuan.tencent.com). Meanwhile, we have released the LoRA training code for customizable special effects, which can be used to create more interesting video effects.
|
|
|
24 |
|
25 |
|
26 |
## π₯π₯π₯ News!!
|
27 |
+
* Mar 07, 2025: π₯ We have fixed the bug in our open-source version that caused ID changes. Please try the new model weights of [HunyuanVideo-I2V](https://huggingface.co/tencent/HunyuanVideo-I2V) to ensure full visual consistency in the first frame and produce higher quality videos.
|
28 |
* Mar 06, 2025: π We release the inference code and model weights of HunyuanVideo-I2V. [Download](https://github.com/Tencent/HunyuanVideo-I2V/blob/main/ckpts/README.md).
|
29 |
|
30 |
|
31 |
+
### Frist Frame Consistency Demo
|
32 |
+
| Reference Image | Generated Video |
|
33 |
+
|:----------------:|:----------------:|
|
34 |
+
| <img src="https://github.com/user-attachments/assets/83e7a097-ffca-40db-9c72-be01d866aa7d" width="80%"> | <video src="https://github.com/user-attachments/assets/f81d2c88-bb1a-43f8-b40f-1ccc20774563" width="100%"> </video> |
|
35 |
+
ο½ <img src="https://github.com/user-attachments/assets/c385a11f-60c7-4919-b0f1-bc5e715f673c" width="80%"> | <video src="https://github.com/user-attachments/assets/0c29ede9-0481-4d40-9c67-a4b6267fdc2d" width="100%"> </video> |
|
36 |
+
ο½ <img src="https://github.com/user-attachments/assets/5763f5eb-0be5-4b36-866a-5199e31c5802" width="95%"> | <video src="https://github.com/user-attachments/assets/a8da0a1b-ba7d-45a4-a901-5d213ceaf50e" width="100%"> </video> |
|
37 |
+
|
38 |
+
<!-- ### Customizable I2V LoRA Demo
|
39 |
+
|
40 |
+
| I2V Lora Effect | Reference Image | Generated Video |
|
41 |
+
|:---------------:|:--------------------------------:|:----------------:|
|
42 |
+
| Hair growth | <img src="./assets/demo/i2v_lora/imgs/hair_growth.png" width="40%"> | <video src="https://github.com/user-attachments/assets/06b998ae-bbde-4c1f-96cb-a25a9197d5cb" width="100%"> </video> |
|
43 |
+
| Embrace | <img src="./assets/demo/i2v_lora/imgs/embrace.png" width="40%"> | <video src="https://github.com/user-attachments/assets/f8c99eb1-2a43-489a-ba02-6bd50a6dd260" width="100%" > </video> |
|
44 |
+
<!-- | Hair growth | <img src="./assets/demo/i2v_lora/imgs/hair_growth.png" width="40%"> | <video src="https://github.com/user-attachments/assets/06b998ae-bbde-4c1f-96cb-a25a9197d5cb" width="100%" poster="./assets/demo/i2v_lora/imgs/hair_growth.png"> </video> |
|
45 |
+
| Embrace | <img src="./assets/demo/i2v_lora/imgs/embrace.png" width="40%"> | <video src="https://github.com/user-attachments/assets/f8c99eb1-2a43-489a-ba02-6bd50a6dd260" width="100%" poster="./assets/demo/i2v_lora/imgs/hair_growth.png"> </video> | -->
|
46 |
+
|
47 |
+
<!-- ## 𧩠Community Contributions -->
|
48 |
+
|
49 |
+
<!-- If you develop/use HunyuanVideo-I2V in your projects, welcome to let us know. -->
|
50 |
+
|
51 |
+
<!-- - ComfyUI-Kijai (FP8 Inference, V2V and IP2V Generation): [ComfyUI-HunyuanVideoWrapper](https://github.com/kijai/ComfyUI-HunyuanVideoWrapper) by [Kijai](https://github.com/kijai) -->
|
52 |
+
<!-- - ComfyUI-Native (Native Support): [ComfyUI-HunyuanVideo](https://comfyanonymous.github.io/ComfyUI_examples/hunyuan_video/) by [ComfyUI Official](https://github.com/comfyanonymous/ComfyUI) -->
|
53 |
+
|
54 |
+
<!-- - FastVideo (Consistency Distilled Model and Sliding Tile Attention): [FastVideo](https://github.com/hao-ai-lab/FastVideo) and [Sliding Tile Attention](https://hao-ai-lab.github.io/blogs/sta/) by [Hao AI Lab](https://hao-ai-lab.github.io/)
|
55 |
+
- HunyuanVideo-gguf (GGUF Version and Quantization): [HunyuanVideo-gguf](https://huggingface.co/city96/HunyuanVideo-gguf) by [city96](https://huggingface.co/city96)
|
56 |
+
- Enhance-A-Video (Better Generated Video for Free): [Enhance-A-Video](https://github.com/NUS-HPC-AI-Lab/Enhance-A-Video) by [NUS-HPC-AI-Lab](https://ai.comp.nus.edu.sg/)
|
57 |
+
- TeaCache (Cache-based Accelerate): [TeaCache](https://github.com/LiewFeng/TeaCache) by [Feng Liu](https://github.com/LiewFeng)
|
58 |
+
- HunyuanVideoGP (GPU Poor version): [HunyuanVideoGP](https://github.com/deepbeepmeep/HunyuanVideoGP) by [DeepBeepMeep](https://github.com/deepbeepmeep)
|
59 |
+
-->
|
60 |
+
|
61 |
+
|
62 |
+
|
63 |
## π Open-source Plan
|
64 |
- HunyuanVideo-I2V (Image-to-Video Model)
|
|
|
65 |
- [x] Inference
|
66 |
- [x] Checkpoints
|
67 |
- [x] ComfyUI
|
68 |
+
- [ ] Lora training scripts
|
69 |
- [ ] Multi-gpus Sequence Parallel inference (Faster inference speed on more gpus)
|
70 |
- [ ] Diffusers
|
71 |
- [ ] FP8 Quantified weight
|
|
|
81 |
- [Installation Guide for Linux](#installation-guide-for-linux)
|
82 |
- [𧱠Download Pretrained Models](#-download-pretrained-models)
|
83 |
- [π Single-gpu Inference](#-single-gpu-inference)
|
84 |
+
- [Tips for Using Image-to-Video Models](#tips-for-using-image-to-video-models)
|
85 |
- [Using Command Line](#using-command-line)
|
86 |
- [More Configurations](#more-configurations)
|
|
|
|
|
|
|
|
|
|
|
|
|
87 |
- [π BibTeX](#-bibtex)
|
88 |
- [Acknowledgements](#acknowledgements)
|
89 |
---
|
90 |
|
91 |
## **HunyuanVideo-I2V Overall Architecture**
|
92 |
+
Leveraging the advanced video generation capabilities of [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), we have extended its application to image-to-video generation tasks. To achieve this, we employ a token replace technique to effectively reconstruct and incorporate reference image information into the video generation process.
|
93 |
|
94 |
Since we utilizes a pre-trained Multimodal Large Language Model (MLLM) with a Decoder-Only architecture as the text encoder, we can significantly enhance the model's ability to comprehend the semantic content of the input image and to seamlessly integrate information from both the image and its associated caption. Specifically, the input image is processed by the MLLM to generate semantic image tokens. These tokens are then concatenated with the video latent tokens, enabling comprehensive full-attention computation across the combined data.
|
95 |
|
|
|
167 |
|
168 |
## 𧱠Download Pretrained Models
|
169 |
|
170 |
+
The details of download pretrained models are shown [here](ckpts/README.md).
|
171 |
|
172 |
|
173 |
|
|
|
175 |
|
176 |
Similar to [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), HunyuanVideo-I2V supports high-resolution video generation, with resolution up to 720P and video length up to 129 frames (5 seconds).
|
177 |
|
178 |
+
### Tips for Using Image-to-Video Models
|
179 |
+
- **Use Concise Prompts**: To effectively guide the model's generation, keep your prompts short and to the point.
|
180 |
+
- **Include Key Elements**: A well-structured prompt should cover:
|
181 |
+
- **Main Subject**: Specify the primary focus of the video.
|
182 |
+
- **Action**: Describe the main movement or activity taking place.
|
183 |
+
- **Background (Optional)**: Set the scene for the video.
|
184 |
+
- **Camera Angle (Optional)**: Indicate the perspective or viewpoint.
|
185 |
+
- **Avoid Overly Detailed Prompts**: Lengthy or highly detailed prompts can lead to unnecessary transitions in the video output.
|
186 |
+
|
187 |
+
<!-- **For image-to-video models, we recommend using concise prompts to guide the model's generation process. A good prompt should include elements such as background, main subject, action, and camera angle. Overly long or excessively detailed prompts may introduce unnecessary transitions.** -->
|
188 |
+
|
189 |
### Using Command Line
|
190 |
|
191 |
<!-- ### Run a Gradio Server
|
|
|
195 |
# set SERVER_NAME and SERVER_PORT manually
|
196 |
# SERVER_NAME=0.0.0.0 SERVER_PORT=8081 python3 gradio_server.py --flow-reverse
|
197 |
``` -->
|
198 |
+
If you want to generate a more **stable** video, you can set `--i2v-stability` and `--flow-shift 7.0`. Execute the command as follows
|
199 |
```bash
|
200 |
cd HunyuanVideo-I2V
|
201 |
|
202 |
python3 sample_image2video.py \
|
203 |
+
--prompt "An Asian man with short hair in black tactical uniform and white clothes waves a firework stick." \
|
204 |
+
--i2v-image-path ./demo/imgs/0.jpg \
|
205 |
--model HYVideo-T/2 \
|
|
|
206 |
--i2v-mode \
|
|
|
207 |
--i2v-resolution 720p \
|
208 |
+
--i2v-stability \
|
209 |
+
--infer-steps 50 \
|
210 |
--video-length 129 \
|
211 |
+
--flow-reverse \
|
212 |
+
--flow-shift 7.0 \
|
213 |
+
--seed 0 \
|
214 |
+
--embedded-cfg-scale 6.0 \
|
215 |
+
--use-cpu-offload \
|
216 |
+
--save-path ./results
|
217 |
+
```
|
218 |
+
If you want to generate a more **high-dynamic** video, you can **unset** `--i2v-stability` and `--flow-shift 17.0`. Execute the command as follows
|
219 |
+
```bash
|
220 |
+
cd HunyuanVideo-I2V
|
221 |
+
|
222 |
+
python3 sample_image2video.py \
|
223 |
+
--prompt "An Asian man with short hair in black tactical uniform and white clothes waves a firework stick." \
|
224 |
+
--i2v-image-path ./demo/imgs/0.jpg \
|
225 |
+
--model HYVideo-T/2 \
|
226 |
+
--i2v-mode \
|
227 |
+
--i2v-resolution 720p \
|
228 |
--infer-steps 50 \
|
229 |
+
--video-length 129 \
|
230 |
--flow-reverse \
|
231 |
--flow-shift 17.0 \
|
232 |
--seed 0 \
|
233 |
+
--embedded-cfg-scale 6.0 \
|
234 |
--use-cpu-offload \
|
235 |
+
--save-path ./results
|
236 |
```
|
237 |
### More Configurations
|
238 |
|
239 |
We list some more useful configurations for easy usage:
|
240 |
|
241 |
+
| Argument | Default | Description |
|
242 |
+
|:----------------------:|:----------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
|
243 |
+
| `--prompt` | None | The text prompt for video generation. |
|
244 |
+
| `--model` | HYVideo-T/2-cfgdistill | Here we use HYVideo-T/2 for I2V, HYVideo-T/2-cfgdistill is used for T2V mode. |
|
245 |
+
| `--i2v-mode` | False | Whether to open i2v mode. |
|
246 |
+
| `--i2v-image-path` | ./assets/demo/i2v/imgs/0.jpg | The reference image for video generation. |
|
247 |
+
| `--i2v-resolution` | 720p | The resolution for the generated video. |
|
248 |
+
| `--i2v-stability` | False | Whether to use stable mode for i2v inference. |
|
249 |
+
| `--video-length` | 129 | The length of the generated video. |
|
250 |
+
| `--infer-steps` | 50 | The number of steps for sampling. |
|
251 |
+
| `--flow-shift` | 7.0 | Shift factor for flow matching schedulers. We recommend 7 with `--i2v-stability` switch on for more stable video, 17 with `--i2v-stability` switch off for more dynamic video |
|
252 |
+
| `--flow-reverse` | False | If reverse, learning/sampling from t=1 -> t=0. |
|
253 |
+
| `--seed` | None | The random seed for generating video, if None, we init a random seed. |
|
254 |
+
| `--use-cpu-offload` | False | Use CPU offload for the model load to save more memory, necessary for high-res video generation. |
|
255 |
+
| `--save-path` | ./results | Path to save the generated video. |
|
256 |
+
|
257 |
|
258 |
|
259 |
+
<!-- ## π Customizable I2V LoRA effects training
|
260 |
|
261 |
### Requirements
|
262 |
|
|
|
283 |
|
284 |
For example, AI hair growth effect (trigger): rapid_hair_growth, The hair of the characters in the video is growing rapidly. + original prompt
|
285 |
|
286 |
+
After having the training video and prompt pair, refer to [here](hyvideo/hyvae_extract/README.md) for training data construction.
|
287 |
|
288 |
|
289 |
### Training
|
|
|
326 |
|:-------------------:|:-------:|:----------------------------:|
|
327 |
| `--use-lora` | False | Whether to open lora mode. |
|
328 |
| `--lora-scale` | 1.0 | Fusion scale for lora model. |
|
329 |
+
| `--lora-path` | "" | Weight path for lora model. | -->
|
330 |
|
331 |
|
332 |
## π BibTeX
|