CogView4-6B / README.md

Ubuntu

“12”

fb6f572 7 days ago

6.24 kB

	---
	license: apache-2.0
	language:
	- zh
	- en
	base_model:
	- THUDM/glm-4-9b
	pipeline_tag: text-to-image
	library_name: diffusers
	---

	# CogView4-6B

	<p style="text-align: center;">
	<div align="center">
	<img src=https://github.com/THUDM/CogView4/raw/main/resources/logo.svg width="50%"/>
	</div>
	<p align="center">
	<a href="https://huggingface.co./spaces/THUDM-HF-SPACE/CogView4">🤗 Space \| </a>
	<a href="https://github.com/THUDM/CogView4">🌐 Github </a> \|
	<a href="https://arxiv.org/pdf/2403.05121">📜 arxiv </a>
	</p>

	![img](https://raw.githubusercontent.com/THUDM/CogView4/refs/heads/main/resources/showcase.png)

	## Inference Requirements and Model Introduction

	+ Resolution: Width and height must be between `512px` and `2048px`, divisible by `32`, and ensure the maximum number of pixels does not exceed `2^21` px.
	+ Precision: BF16 / FP32 (FP16 is not supported as it will cause overflow resulting in completely black images)

	Using `BF16` precision with `batchsize=4` for testing, the memory usage is shown in the table below:

	\| Resolution \| enable_model_cpu_offload OFF \| enable_model_cpu_offload ON \| enable_model_cpu_offload ON </br> Text Encoder 4bit \|
	\|--------------\|------------------------------\|-----------------------------\|----------------------------------------------------\|
	\| 512 * 512 \| 33GB \| 20GB \| 13G \|
	\| 1280 * 720 \| 35GB \| 20GB \| 13G \|
	\| 1024 * 1024 \| 35GB \| 20GB \| 13G \|
	\| 1920 * 1280 \| 39GB \| 20GB \| 14G \|
	\| 2048 * 2048 \| 43GB \| 21GB \| 14G \|

	## Quick Start

	First, ensure you install the `diffusers` library from source.

	```shell
	pip install git+https://github.com/huggingface/diffusers.git
	cd diffusers
	pip install -e .
	```

	Then, run the following code:

	```python
	from diffusers import CogView4Pipeline
	pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16)

	# Open it for reduce GPU memory usage
	pipe.enable_model_cpu_offload()
	pipe.vae.enable_slicing()
	pipe.vae.enable_tiling()

	prompt = "A vibrant cherry red sports car sits proudly under the gleaming sun, its polished exterior smooth and flawless, casting a mirror-like reflection. The car features a low, aerodynamic body, angular headlights that gaze forward like predatory eyes, and a set of black, high-gloss racing rims that contrast starkly with the red. A subtle hint of chrome embellishes the grille and exhaust, while the tinted windows suggest a luxurious and private interior. The scene conveys a sense of speed and elegance, the car appearing as if it's about to burst into a sprint along a coastal road, with the ocean's azure waves crashing in the background."
	image = pipe(
	prompt=prompt,
	guidance_scale=3.5,
	num_images_per_prompt=1,
	num_inference_steps=50,
	width=1024,
	height=1024,
	).images[0]

	image.save("cogview4.png")
	```

	## Model Performance

	We've tested on multiple benchmarks and achieved the following scores:

	### dpg_bench

	\| model \| overall \| global \| entity \| attribute \| relation \| other \|
	\|-------\|---------\|--------\|--------\|-----------\|----------\|-------\|
	\| sdxl \| 74.65 \| 83.27 \| 82.43 \| 80.91 \| 86.76 \| 80.41 \|
	\| pixart-alpha \| 71.11 \| 74.97 \| 79.32 \| 78.60 \| 82.57 \| 76.96 \|
	\| sd3-medium \| 84.08 \| 87.90 \| 91.01 \| 88.83 \| 80.70 \| 88.68 \|
	\| dalle-3 \| 83.50 \| 90.97 \| 89.61 \| 88.39 \| 90.58 \| 89.83 \|
	\| flux.1-dev \| 83.79 \| 85.80 \| 86.79 \| 89.98 \| 90.04 \| 89.90 \|
	\| cogview4 \| 85.13 \| 83.85 \| 90.35 \| 91.17 \| 91.14 \| 87.29 \|


	### Geneval

	\| model \| overall \| single \| two \| counting \| colors \| position \| Color attribution \|
	\|-------\|---------\|--------\|-----\|----------\|--------\|----------\|------------------\|
	\| sdxl \| 0.55 \| 0.98 \| 0.74 \| 0.39 \| 0.85 \| 0.15 \| 0.23 \|
	\| pixart-alpha \| 0.48 \| 0.98 \| 0.50 \| 0.44 \| 0.80 \| 0.08 \| 0.07 \|
	\| sd3-meidum \| 0.74 \| 0.99 \| 0.94 \| 0.72 \| 0.89 \| 0.33 \| 0.60 \|
	\| dall-e 3 \| 0.67 \| 0.96 \| 0.87 \| 0.47 \| 0.83 \| 0.43 \| 0.45 \|
	\| flux.1-dev \| 0.66 \| 0.98 \| 0.79 \| 0.73 \| 0.77 \| 0.22 \| 0.45 \|
	\| cogview4 \| 0.73 \| 0.99 \| 0.86 \| 0.66 \| 0.79 \| 0.48 \| 0.58 \|

	### t2i_compbench

	\| model \| color \| shape \| texture \| 2d-spatial \| 3d-spatial \| numeracy \| Non-spatial clip \| complex 3-in-1 \|
	\|-------\|-------\|-------\|---------\|------------\|------------\|----------\|-----------------\|---------------\|
	\| sdxl \| 0.5879 \| 0.4687 \| 0.5299 \| 0.2133 \| 0.3566 \| 0.4988 \| 0.3119 \| 0.3237 \|
	\| pixart-alpha \| 0.6690 \| 0.4927 \| 0.6477 \| 0.2064 \| 0.3901 \| 0.5058 \| 0.3197 \| 0.3433 \|
	\| sd3-medium \| 0.8132 \| 0.5885 \| 0.7334 \| 0.3200 \| 0.4084 \| 0.6174 \| 0.3140 \| 0.3771 \|
	\| dall-e 3 \| 0.7785 \| 0.6205 \| 0.7036 \| 0.2865 \| 0.3744 \| 0.5880 \| 0.3003 \| 0.3773 \|
	\| flux.1-dev \| 0.7572 \| 0.5066 \| 0.6300 \| 0.2700 \| 0.3992 \| 0.6165 \| 0.3065 \| 0.3628 \|
	\| cogview4 \| 0.7786 \| 0.5880 \| 0.6983 \| 0.3075 \| 0.3708 \| 0.6626 \| 0.3056 \| 0.3869 \|


	## Chinese Text Accuracy Evaluation

	\| model \| Precision \| Recall \| F1 Score \| pick@4 \|
	\|-------\|-----------\|--------\|----------\|--------\|
	\| kolors \| 0.6094 \| 0.1886 \| 0.2880 \| 0.1633 \|
	\| cogview4 \| 0.6969 \| 0.5532 \| 0.6168 \| 0.3265 \|


	## Citation

	🌟 If you find our work helpful, please consider citing our paper and leaving valuable stars

	```
	@article{zheng2024cogview3,
	title={Cogview3: Finer and faster text-to-image generation via relay diffusion},
	author={Zheng, Wendi and Teng, Jiayan and Yang, Zhuoyi and Wang, Weihan and Chen, Jidong and Gu, Xiaotao and Dong, Yuxiao and Ding, Ming and Tang, Jie},
	journal={arXiv preprint arXiv:2403.05121},
	year={2024}
	}
	```

	## License

	This model is released under the [Apache 2.0 License](LICENSE).