linwf commited on
Commit
8560cb7
·
verified ·
1 Parent(s): b2e02da

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +130 -3
README.md CHANGED
@@ -1,3 +1,130 @@
1
- ---
2
- license: openrail++
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CascadeV | An Implemention of Würstchen architecture for High-Resolution Video Generation
2
+
3
+ ## News
4
+
5
+ **[2024.07.17]** We release the [code](https://github.com/bytedance/CascadeV) and pretrained [weights](https://huggingface.co/ByteDance/CascadeV) of a DiT-based video VAE, which supports video reconstruction with a high compression factor (1x32x32=1024). The T2V model is still on the way.
6
+
7
+ ## Introduction
8
+
9
+ CascadeV is a video generation pipeline built upon the [Würstchen](https://openreview.net/forum?id=gU58d5QeGv) architecture. By using a highly compressed latent representation, we can generate longer videos with higher resolution.
10
+
11
+ ## Video VAE
12
+
13
+ Comparison of Our Cascade Approach with Other VAEs (on Latent Space of Shape 8x32x32)
14
+
15
+ <img src="https://code.byted.org/data/CascadeV/raw/master/docs/compare.png" />
16
+
17
+ Video Recontruction: Original (left) vs. Reconstructed (right) | *Click to view the videos*
18
+
19
+ <table class="center">
20
+ <tr>
21
+ <td><a href='https://code.byted.org/data/CascadeV/raw/master/docs/1.mp4'><img src="https://code.byted.org/data/CascadeV/raw/master/docs/1.jpg" /></a></td>
22
+ <td><a href='https://code.byted.org/data/CascadeV/raw/master/docs/2.mp4'><img src="https://code.byted.org/data/CascadeV/raw/master/docs/2.jpg" /></a></td>
23
+ </tr>
24
+ <tr>
25
+ <td><a href='https://code.byted.org/data/CascadeV/raw/master/docs/3.mp4'><img src="https://code.byted.org/data/CascadeV/raw/master/docs/3.jpg" /></a></td>
26
+ <td><a href='https://code.byted.org/data/CascadeV/raw/master/docs/4.mp4'><img src="https://code.byted.org/data/CascadeV/raw/master/docs/4.jpg" /></a></td>
27
+ </tr>
28
+ </table>
29
+
30
+ ### 1. Model Architecture
31
+
32
+ <img src="https://code.byted.org/data/CascadeV/raw/master/docs/arch.jpg" />
33
+
34
+ #### 1.1 DiT
35
+
36
+ We use [PixArt-Σ](https://github.com/PixArt-alpha/PixArt-sigma) as our base model with the following modifications:
37
+
38
+ * Replace the original VAE (of [SDXL](https://arxiv.org/abs/2307.01952)) with the one from [Stable Video Diffusion](https://github.com/Stability-AI/generative-models).
39
+ * Use sematic compressor from [StableCascade](https://github.com/Stability-AI/StableCascade) to provide the low-resolution latent input.
40
+ * Remove text encoder and all multi-head cross-attention layers since we are not using text condition.
41
+ * Replace all 2D attention layers to 3D. We find that 3D attention outperforms 2+1D (i.e. alternative spatial and temporal attention), especially in temporal consistency.
42
+
43
+ Comparison of 2+1D Attention (left) vs. 3D Attention (right)
44
+
45
+ <img src="https://code.byted.org/data/CascadeV/raw/master/docs/2d1d_vs_3d.gif" />
46
+
47
+ #### 1.2. Grid Attention
48
+
49
+ Using 3D attention requires much more computational resources than 2D/2+1D, especially with higher resolution. As a compromise solution, we replace some 3D attention layers with alternative spatial and temporal grid attention.
50
+
51
+ <img src="https://code.byted.org/data/CascadeV/raw/master/docs/grid.jpg" />
52
+
53
+ ### 2. Evaluation
54
+
55
+ Dataset: We perform qualitative comparison with other baselines on the dataset [Inter4K](https://alexandrosstergiou.github.io/datasets/Inter4K/index.html), by sampling the first 200 videos from the Inter4K to create a video dataset with a resolution of 1024x1024 and 30 FPS.
56
+
57
+ Metrics: We use PSNR, SSIM and LPIPS to evaluate the per-frame quality (and the similarity between original and reconstructed video) and [VBench](https://github.com/Vchitect/VBench) to evaluate the video quality independently.
58
+
59
+ #### 2.1 PSNR/SSIM/LPIPS
60
+
61
+ Diffusion-based VAEs (like StableCascade and our model) performs poorly in reconstruction metrics, due to their ability to produce videos with more fine-grained details but less similiar to the original ones.
62
+
63
+ | Model/Compression Factor | PSNR↑ | SSIM↑ | LPIPS↓ |
64
+ | -- | -- | -- | -- |
65
+ | Open-Sora-Plan v1.1/4x8x8=256 | 25.7282 | 0.8000 | 0.1030 |
66
+ | EasyAnimate v3/4x8x8=256 | **28.8666** | **0.8505** | **0.0818** |
67
+ | StableCascade/1x32x32=1024 | 24.3336 | 0.6896 | 0.1395 |
68
+ | Ours/1x32x32=1024 | 23.7320 | 0.6742 | 0.1786 |
69
+
70
+ #### 2.2 VBench
71
+
72
+ Our approach has comparable performance to the previous VAEs in both frame-wise and temporal quality even with much larger compression factor.
73
+
74
+ | Model/Compression Factor | Subject Consistency | Background Consistency | Temporal Flickering | Motion Smoothness | Imaging Quality | Aesthetic Quality |
75
+ | -- | -- | -- | -- | -- | -- | -- |
76
+ | Open-Sora-Plan v1.1/4x8x8=256 | 0.9519 | 0.9618 | 0.9573 | 0.9789 | 0.6791 | 0.5450 |
77
+ | EasyAnimate v3/4x8x8=256 | 0.9578 | **0.9695** | 0.9615 | **0.9845** | 0.6735 | 0.5535 |
78
+ | StableCascade/1x32x32=1024 | 0.9490 | 0.9517 | 0.9430 | 0.9639 | **0.6811** | **0.5675** |
79
+ | Ours/1x32x32=1024 | **0.9601** | 0.9679 | **0.9626** | 0.9837 | 0.6747 | 0.5579 |
80
+
81
+ ### 3. Usage
82
+
83
+ #### 3.1 Installation
84
+
85
+ Recommend to use Conda
86
+
87
+ ```
88
+ conda create -n cascadev python==3.9.0
89
+ conda activate cascadev
90
+ ```
91
+
92
+ Install [PixArt-Σ](https://github.com/PixArt-alpha/PixArt-sigma)
93
+
94
+ ```
95
+ bash install.sh
96
+ ```
97
+
98
+ #### 3.2 Download Pretrained Weights
99
+
100
+ ```
101
+ bash pretrained/download.sh
102
+ ```
103
+
104
+ #### 3.3 Video Reconstruction
105
+
106
+ A sample script for video reconstruction with compression factor of 32
107
+
108
+ ```
109
+ bash recon.sh
110
+ ```
111
+
112
+ Results of Video Reconstruction: w/o LDM (left) vs. w/ LDM (right)
113
+
114
+ <img src="https://code.byted.org/data/CascadeV/raw/master/docs/w_vs_wo_ldm.png" />
115
+
116
+ *It takes almost 1 minutes to reconstruct a video of shape 8x1024x1024 with one NVIDIA-A800*
117
+
118
+ #### 3.4 Train VAE
119
+
120
+ * Replace "video_list" in configs/s1024.effn-f32.py with your own video datasets
121
+ * Then run
122
+
123
+ ```
124
+ bash train_vae.sh
125
+ ```
126
+
127
+ ## Acknowledgement
128
+ * [PixArt-Σ](https://github.com/PixArt-alpha/PixArt-sigma): The **main codebase** we built upon.
129
+ * [StableCascade](https://github.com/Stability-AI/StableCascade): Würstchen architecture we used.
130
+ * Thanks [Stable Video Diffusion](https://github.com/Stability-AI/generative-models) for its amazing Video VAE.