CogVideoX-5b / README.md

update

17ab7f9 6 months ago

8.04 kB

	---
	license: other
	license_link: https://huggingface.co./THUDM/CogVideoX-5b/blob/main/LICENSE
	language:
	- en
	tags:
	- cogvideox
	- video-generation
	- thudm
	- text-to-video
	inference: false
	---

	# CogVideoX-5B

	<p style="text-align: center;">
	<div align="center">
	<img src=https://github.com/THUDM/CogVideo/raw/main/resources/logo.svg width="50%"/>
	</div>
	<p align="center">
	<a href="https://huggingface.co./THUDM/CogVideoX-5b/blob/main/README_zh.md">📄 中文阅读</a> \|
	<a href="https://huggingface.co./spaces/THUDM/CogVideoX-5B">🤗 Huggingface Space</a> \|
	<a href="https://github.com/THUDM/CogVideo">🌐 Github </a> \|
	<a href="https://arxiv.org/pdf/2408.06072">📜 arxiv </a>
	</p>

	## Demo Show

	## Model Introduction

	CogVideoX is an open-source version of the video generation model originating from [QingYing](https://chatglm.cn/video).
	The table below displays the list of video generation models we currently offer, along with their foundational
	information.

	<table style="border-collapse: collapse; width: 100%;">
	<tr>
	<th style="text-align: center;">Model Name</th>
	<th style="text-align: center;">CogVideoX-2B</th>
	<th style="text-align: center;">CogVideoX-5B (Current Repository)</th>
	</tr>
	<tr>
	<td style="text-align: center;">Model Description</td>
	<td style="text-align: center;">Entry-level model with compatibility and low cost for running and secondary development.</td>
	<td style="text-align: center;">Larger model with higher video generation quality and better visual effects.</td>
	</tr>
	<tr>
	<td style="text-align: center;">Inference Speed<br>(Step = 50)</td>
	<td style="text-align: center;">FP16: ~90* s</td>
	<td style="text-align: center;">BF16: ~180* s</td>
	</tr>
	<tr>
	<td style="text-align: center;">Inference Precision</td>
	<td style="text-align: center;"><b>FP16*(recommended)</b>, BF16, FP32, INT8, no support for INT4</td>
	<td style="text-align: center;"><b>BF16(recommended)</b>, FP16, FP32, INT8, no support for INT4</td>
	</tr>
	<tr>
	<td style="text-align: center;">Single GPU Memory Usage<br></td>
	<td style="text-align: center;">FP16: 18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>12.5GB* using diffusers</b><br><b>INT8: 7.8GB* using diffusers</b></td>
	<td style="text-align: center;">BF16: 26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>20.7GB* using diffusers</b><br><b>INT8: 11.4GB* using diffusers</b></td>
	</tr>
	<tr>
	<td style="text-align: center;">Multi-GPU Memory Usage</td>
	<td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
	<td style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
	</tr>
	<tr>
	<td style="text-align: center;">Fine-tuning Memory Usage (per GPU)</td>
	<td style="text-align: center;">47 GB (bs=1, LORA)<br> 61 GB (bs=2, LORA)<br> 62GB (bs=1, SFT)</td>
	<td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)<br></td>
	</tr>
	<tr>
	<td style="text-align: center;">Prompt Language</td>
	<td colspan="2" style="text-align: center;">English*</td>
	</tr>
	<tr>
	<td style="text-align: center;">Max Prompt Length</td>
	<td colspan="2" style="text-align: center;">226 Tokens</td>
	</tr>
	<tr>
	<td style="text-align: center;">Video Length</td>
	<td colspan="2" style="text-align: center;">6 seconds</td>
	</tr>
	<tr>
	<td style="text-align: center;">Frame Rate</td>
	<td colspan="2" style="text-align: center;">8 frames / second </td>
	</tr>
	<tr>
	<td style="text-align: center;">Video Resolution</td>
	<td colspan="2" style="text-align: center;">720 * 480, no support for other resolutions (including fine-tuning)</td>
	</tr>
	<tr>
	<td style="text-align: center;">Positional Encoding</td>
	<td style="text-align: center;">3d_sincos_pos_embed</td>
	<td style="text-align: center;">3d_rope_pos_embed<br></td>
	</tr>
	</table>

	Data Explanation

	+ When testing with the diffusers library, the `enable_model_cpu_offload()` option and `pipe.vae.enable_tiling()`
	optimization were enabled. This solution has not been tested on devices other than NVIDIA A100 / H100. Typically,
	this solution is adaptable to all devices above the NVIDIA Ampere architecture. If the optimization is disabled,
	memory usage will increase significantly, with peak memory being about 3 times the table value.
	+ The CogVideoX-2B model was trained using `FP16` precision, so it is recommended to use `FP16` for inference.
	+ For multi-GPU inference, the `enable_model_cpu_offload()` optimization needs to be disabled.
	+ Using the INT8 model will lead to reduced inference speed. This is done to allow low-memory GPUs to perform inference
	while maintaining minimal video quality loss, though the inference speed will be significantly reduced.
	+ Inference speed tests also used the memory optimization mentioned above. Without memory optimization, inference speed
	increases by approximately 10%. Only the `diffusers` version of the model supports quantization.
	+ The model only supports English input; other languages can be translated to English for refinement by large models.

	Note

	+ Using [SAT](https://github.com/THUDM/SwissArmyTransformer) for inference and fine-tuning of SAT version
	models. Feel free to visit our GitHub for more information.

	## Quick Start 🤗

	This model supports deployment using the huggingface diffusers library. You can deploy it by following these steps.

	**We recommend that you visit our [GitHub](https://github.com/THUDM/CogVideo) and check out the relevant prompt
	optimizations and conversions to get a better experience.**

	1. Install the required dependencies

	```shell
	# diffusers>=0.30.1
	# transformers>=0.44.0
	# accelerate>=0.33.0 (suggest install from source)
	# imageio-ffmpeg>=0.5.1
	pip install --upgrade transformers accelerate diffusers imageio-ffmpeg
	```

	2. Run the code

	```python
	import torch
	from diffusers import CogVideoXPipeline
	from diffusers.utils import export_to_video

	prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

	pipe = CogVideoXPipeline.from_pretrained(
	"THUDM/CogVideoX-5b",
	torch_dtype=torch.bfloat16
	)

	pipe.enable_model_cpu_offload()
	pipe.vae.enable_tiling()

	video = pipe(
	prompt=prompt,
	num_videos_per_prompt=1,
	num_inference_steps=50,
	num_frames=49,
	guidance_scale=6,
	generator=torch.Generator(device="cuda").manual_seed(42),
	).frames[0]

	export_to_video(video, "output.mp4", fps=8)
	```

	## Explore the Model

	Welcome to our [github](https://github.com/THUDM/CogVideo), where you will find:

	1. More detailed technical details and code explanation.
	2. Optimization and conversion of prompt words.
	3. Reasoning and fine-tuning of SAT version models, and even pre-release.
	4. Project update log dynamics, more interactive opportunities.
	5. CogVideoX toolchain to help you better use the model.
	6. INT8 model inference code support.

	## Model License

	This model is released under the [CogVideoX LICENSE](LICENSE).

	## Citation

	```
	@article{yang2024cogvideox,
	title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
	author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
	journal={arXiv preprint arXiv:2408.06072},
	year={2024}
	}
	```