HunyuanVideo-I2V / README.md
noaltian's picture
Update README.md
ec975f9 verified
metadata
license: other
license_name: tencent-hunyuan-community
license_link: LICENSE

δΈ­ζ–‡ι˜…θ―»

HunyuanVideo-I2V πŸŒ…


Following the great successful open-sourcing of our HunyuanVideo, we proudly present the HunyuanVideo-I2V, a new image-to-video generation framework to accelerate open-source community exploration!

This repo contains offical PyTorch model definitions, pre-trained weights and inference/sampling code. You can find more visualizations on our project page. Meanwhile, we have released the LoRA training code for customizable special effects, which can be used to create more interesting video effects.

HunyuanVideo: A Systematic Framework For Large Video Generation Model

πŸ”₯πŸ”₯πŸ”₯ News!!

  • Mar 07, 2025: πŸ”₯ We have fixed the bug in our open-source version that caused ID changes. Please try the new model weights of HunyuanVideo-I2V to ensure full visual consistency in the first frame and produce higher quality videos.
  • Mar 06, 2025: πŸ‘‹ We release the inference code and model weights of HunyuanVideo-I2V. Download.

Frist Frame Consistency Demo

Reference Image Generated Video
|
|

πŸ“‘ Open-source Plan

  • HunyuanVideo-I2V (Image-to-Video Model)
    • Inference
    • Checkpoints
    • ComfyUI
    • Lora training scripts
    • Multi-gpus Sequence Parallel inference (Faster inference speed on more gpus)
    • Diffusers
    • FP8 Quantified weight

Contents


HunyuanVideo-I2V Overall Architecture

Leveraging the advanced video generation capabilities of HunyuanVideo, we have extended its application to image-to-video generation tasks. To achieve this, we employ a token replace technique to effectively reconstruct and incorporate reference image information into the video generation process.

Since we utilizes a pre-trained Multimodal Large Language Model (MLLM) with a Decoder-Only architecture as the text encoder, we can significantly enhance the model's ability to comprehend the semantic content of the input image and to seamlessly integrate information from both the image and its associated caption. Specifically, the input image is processed by the MLLM to generate semantic image tokens. These tokens are then concatenated with the video latent tokens, enabling comprehensive full-attention computation across the combined data.

The overall architecture of our system is designed to maximize the synergy between image and text modalities, ensuring a robust and coherent generation of video content from static images. This integration not only improves the fidelity of the generated videos but also enhances the model's ability to interpret and utilize complex multimodal inputs. The overall architecture is as follows.

πŸ“œ Requirements

The following table shows the requirements for running HunyuanVideo-I2V model (batch size = 1) to generate videos:

Model Resolution GPU Peak Memory
HunyuanVideo-I2V 720p 60GB
  • An NVIDIA GPU with CUDA support is required.
    • The model is tested on a single 80G GPU.
    • Minimum: The minimum GPU memory required is 60GB for 720p.
    • Recommended: We recommend using a GPU with 80GB of memory for better generation quality.
  • Tested operating system: Linux

πŸ› οΈ Dependencies and Installation

Begin by cloning the repository:

git clone https://github.com/tencent/HunyuanVideo-I2V
cd HunyuanVideo-I2V

Installation Guide for Linux

We recommend CUDA versions 12.4 or 11.8 for the manual installation.

Conda's installation instructions are available here.

# 1. Create conda environment
conda create -n HunyuanVideo-I2V python==3.11.9

# 2. Activate the environment
conda activate HunyuanVideo-I2V

# 3. Install PyTorch and other dependencies using conda
# For CUDA 12.4
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia

# 4. Install pip dependencies
python -m pip install -r requirements.txt

# 5. Install flash attention v2 for acceleration (requires CUDA 11.8 or above)
python -m pip install ninja
python -m pip install git+https://github.com/Dao-AILab/[email protected]

In case of running into float point exception(core dump) on the specific GPU type, you may try the following solutions:

# Making sure you have installed CUDA 12.4, CUBLAS>=12.4.5.8, and CUDNN>=9.00 (or simply using our CUDA 12 docker image).
pip install nvidia-cublas-cu12==12.4.5.8
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/nvidia/cublas/lib/

Additionally, HunyuanVideo-I2V also provides a pre-built Docker image. Use the following command to pull and run the docker image.

# For CUDA 12.4 (updated to avoid float point exception)
docker pull hunyuanvideo/hunyuanvideo-i2v:cuda_12
docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo-i2v --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo-i2v:cuda_12

🧱 Download Pretrained Models

The details of download pretrained models are shown here.

πŸ”‘ Single-gpu Inference

Similar to HunyuanVideo, HunyuanVideo-I2V supports high-resolution video generation, with resolution up to 720P and video length up to 129 frames (5 seconds).

Tips for Using Image-to-Video Models

  • Use Concise Prompts: To effectively guide the model's generation, keep your prompts short and to the point.
  • Include Key Elements: A well-structured prompt should cover:
    • Main Subject: Specify the primary focus of the video.
    • Action: Describe the main movement or activity taking place.
    • Background (Optional): Set the scene for the video.
    • Camera Angle (Optional): Indicate the perspective or viewpoint.
  • Avoid Overly Detailed Prompts: Lengthy or highly detailed prompts can lead to unnecessary transitions in the video output.

Using Command Line

If you want to generate a more stable video, you can set --i2v-stability and --flow-shift 7.0. Execute the command as follows

cd HunyuanVideo-I2V

python3 sample_image2video.py \
    --prompt "An Asian man with short hair in black tactical uniform and white clothes waves a firework stick." \
    --i2v-image-path ./demo/imgs/0.jpg \
    --model HYVideo-T/2 \
    --i2v-mode \
    --i2v-resolution 720p \
    --i2v-stability \
    --infer-steps 50 \
    --video-length 129 \
    --flow-reverse \
    --flow-shift 7.0 \
    --seed 0 \
    --embedded-cfg-scale 6.0 \
    --use-cpu-offload \
    --save-path ./results

If you want to generate a more high-dynamic video, you can unset --i2v-stability and --flow-shift 17.0. Execute the command as follows

cd HunyuanVideo-I2V

python3 sample_image2video.py \
    --prompt "An Asian man with short hair in black tactical uniform and white clothes waves a firework stick." \
    --i2v-image-path ./demo/imgs/0.jpg \
    --model HYVideo-T/2 \
    --i2v-mode \
    --i2v-resolution 720p \
    --infer-steps 50 \
    --video-length 129 \
    --flow-reverse \
    --flow-shift 17.0 \
    --seed 0 \
    --embedded-cfg-scale 6.0 \
    --use-cpu-offload \
    --save-path ./results

More Configurations

We list some more useful configurations for easy usage:

Argument Default Description
--prompt None The text prompt for video generation.
--model HYVideo-T/2-cfgdistill Here we use HYVideo-T/2 for I2V, HYVideo-T/2-cfgdistill is used for T2V mode.
--i2v-mode False Whether to open i2v mode.
--i2v-image-path ./assets/demo/i2v/imgs/0.jpg The reference image for video generation.
--i2v-resolution 720p The resolution for the generated video.
--i2v-stability False Whether to use stable mode for i2v inference.
--video-length 129 The length of the generated video.
--infer-steps 50 The number of steps for sampling.
--flow-shift 7.0 Shift factor for flow matching schedulers. We recommend 7 with --i2v-stability switch on for more stable video, 17 with --i2v-stability switch off for more dynamic video
--flow-reverse False If reverse, learning/sampling from t=1 -> t=0.
--seed None The random seed for generating video, if None, we init a random seed.
--use-cpu-offload False Use CPU offload for the model load to save more memory, necessary for high-res video generation.
--save-path ./results Path to save the generated video.

πŸ”— BibTeX

If you find HunyuanVideo useful for your research and applications, please cite using this BibTeX:

@misc{kong2024hunyuanvideo,
      title={HunyuanVideo: A Systematic Framework For Large Video Generative Models}, 
      author={Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Junkun Yuan, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, Weiyan Wang, Wenqing Yu, Xinchi Deng, Yang Li, Yanxin Long, Yi Chen, Yutao Cui, Yuanbo Peng, Zhentao Yu, Zhiyu He, Zhiyong Xu, Zixiang Zhou, Zunnan Xu, Yangyu Tao, Qinglin Lu, Songtao Liu, Dax Zhou, Hongfa Wang, Yong Yang, Di Wang, Yuhong Liu, and Jie Jiang, along with Caesar Zhong},
      year={2024},
      archivePrefix={arXiv preprint arXiv:2412.03603},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.03603}, 
}

Acknowledgements

We would like to thank the contributors to the SD3, FLUX, Llama, LLaVA, Xtuner, diffusers and HuggingFace repositories, for their open research and exploration. Additionally, we also thank the Tencent Hunyuan Multimodal team for their help with the text encoder.