CogVideoX-5b / README.md
zR
update
17ab7f9
|
raw
history blame
8.04 kB
---
license: other
license_link: https://huggingface.co./THUDM/CogVideoX-5b/blob/main/LICENSE
language:
- en
tags:
- cogvideox
- video-generation
- thudm
- text-to-video
inference: false
---
# CogVideoX-5B
<p style="text-align: center;">
<div align="center">
<img src=https://github.com/THUDM/CogVideo/raw/main/resources/logo.svg width="50%"/>
</div>
<p align="center">
<a href="https://huggingface.co./THUDM/CogVideoX-5b/blob/main/README_zh.md">πŸ“„ δΈ­ζ–‡ι˜…θ―»</a> |
<a href="https://huggingface.co./spaces/THUDM/CogVideoX-5B">πŸ€— Huggingface Space</a> |
<a href="https://github.com/THUDM/CogVideo">🌐 Github </a> |
<a href="https://arxiv.org/pdf/2408.06072">πŸ“œ arxiv </a>
</p>
## Demo Show
## Model Introduction
CogVideoX is an open-source version of the video generation model originating from [QingYing](https://chatglm.cn/video).
The table below displays the list of video generation models we currently offer, along with their foundational
information.
<table style="border-collapse: collapse; width: 100%;">
<tr>
<th style="text-align: center;">Model Name</th>
<th style="text-align: center;">CogVideoX-2B</th>
<th style="text-align: center;">CogVideoX-5B (Current Repository)</th>
</tr>
<tr>
<td style="text-align: center;">Model Description</td>
<td style="text-align: center;">Entry-level model with compatibility and low cost for running and secondary development.</td>
<td style="text-align: center;">Larger model with higher video generation quality and better visual effects.</td>
</tr>
<tr>
<td style="text-align: center;">Inference Speed<br>(Step = 50)</td>
<td style="text-align: center;">FP16: ~90* s</td>
<td style="text-align: center;">BF16: ~180* s</td>
</tr>
<tr>
<td style="text-align: center;">Inference Precision</td>
<td style="text-align: center;"><b>FP16*(recommended)</b>, BF16, FP32, INT8, no support for INT4</td>
<td style="text-align: center;"><b>BF16(recommended)</b>, FP16, FP32, INT8, no support for INT4</td>
</tr>
<tr>
<td style="text-align: center;">Single GPU Memory Usage<br></td>
<td style="text-align: center;">FP16: 18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>12.5GB* using diffusers</b><br><b>INT8: 7.8GB* using diffusers</b></td>
<td style="text-align: center;">BF16: 26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>20.7GB* using diffusers</b><br><b>INT8: 11.4GB* using diffusers</b></td>
</tr>
<tr>
<td style="text-align: center;">Multi-GPU Memory Usage</td>
<td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
<td style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
</tr>
<tr>
<td style="text-align: center;">Fine-tuning Memory Usage (per GPU)</td>
<td style="text-align: center;">47 GB (bs=1, LORA)<br> 61 GB (bs=2, LORA)<br> 62GB (bs=1, SFT)</td>
<td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)<br></td>
</tr>
<tr>
<td style="text-align: center;">Prompt Language</td>
<td colspan="2" style="text-align: center;">English*</td>
</tr>
<tr>
<td style="text-align: center;">Max Prompt Length</td>
<td colspan="2" style="text-align: center;">226 Tokens</td>
</tr>
<tr>
<td style="text-align: center;">Video Length</td>
<td colspan="2" style="text-align: center;">6 seconds</td>
</tr>
<tr>
<td style="text-align: center;">Frame Rate</td>
<td colspan="2" style="text-align: center;">8 frames / second </td>
</tr>
<tr>
<td style="text-align: center;">Video Resolution</td>
<td colspan="2" style="text-align: center;">720 * 480, no support for other resolutions (including fine-tuning)</td>
</tr>
<tr>
<td style="text-align: center;">Positional Encoding</td>
<td style="text-align: center;">3d_sincos_pos_embed</td>
<td style="text-align: center;">3d_rope_pos_embed<br></td>
</tr>
</table>
**Data Explanation**
+ When testing with the diffusers library, the `enable_model_cpu_offload()` option and `pipe.vae.enable_tiling()`
optimization were enabled. This solution has not been tested on devices other than **NVIDIA A100 / H100**. Typically,
this solution is adaptable to all devices above the **NVIDIA Ampere architecture**. If the optimization is disabled,
memory usage will increase significantly, with peak memory being about 3 times the table value.
+ The CogVideoX-2B model was trained using `FP16` precision, so it is recommended to use `FP16` for inference.
+ For multi-GPU inference, the `enable_model_cpu_offload()` optimization needs to be disabled.
+ Using the INT8 model will lead to reduced inference speed. This is done to allow low-memory GPUs to perform inference
while maintaining minimal video quality loss, though the inference speed will be significantly reduced.
+ Inference speed tests also used the memory optimization mentioned above. Without memory optimization, inference speed
increases by approximately 10%. Only the `diffusers` version of the model supports quantization.
+ The model only supports English input; other languages can be translated to English for refinement by large models.
**Note**
+ Using [SAT](https://github.com/THUDM/SwissArmyTransformer) for inference and fine-tuning of SAT version
models. Feel free to visit our GitHub for more information.
## Quick Start πŸ€—
This model supports deployment using the huggingface diffusers library. You can deploy it by following these steps.
**We recommend that you visit our [GitHub](https://github.com/THUDM/CogVideo) and check out the relevant prompt
optimizations and conversions to get a better experience.**
1. Install the required dependencies
```shell
# diffusers>=0.30.1
# transformers>=0.44.0
# accelerate>=0.33.0 (suggest install from source)
# imageio-ffmpeg>=0.5.1
pip install --upgrade transformers accelerate diffusers imageio-ffmpeg
```
2. Run the code
```python
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
pipe = CogVideoXPipeline.from_pretrained(
"THUDM/CogVideoX-5b",
torch_dtype=torch.bfloat16
)
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()
video = pipe(
prompt=prompt,
num_videos_per_prompt=1,
num_inference_steps=50,
num_frames=49,
guidance_scale=6,
generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]
export_to_video(video, "output.mp4", fps=8)
```
## Explore the Model
Welcome to our [github](https://github.com/THUDM/CogVideo), where you will find:
1. More detailed technical details and code explanation.
2. Optimization and conversion of prompt words.
3. Reasoning and fine-tuning of SAT version models, and even pre-release.
4. Project update log dynamics, more interactive opportunities.
5. CogVideoX toolchain to help you better use the model.
6. INT8 model inference code support.
## Model License
This model is released under the [CogVideoX LICENSE](LICENSE).
## Citation
```
@article{yang2024cogvideox,
title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
journal={arXiv preprint arXiv:2408.06072},
year={2024}
}
```