VAE for high-resolution image generation with stable diffusion

This VAE is trained by adding only one step of noise to the latent and denoising the latent with U-net, to avoid oversensitivity to latent. This process reduces the possibility to describe too much detail in some objects, such as plants and eyes, etc., than in the surroundings when generated at high resolution. The dataset consists of 19k images tagged nijijourneyv5 and published on the web, and was denoised using the same dataset trained models.

sample

training details

base model: VAE developed by CompVis
19k 1images
2 epochs
Aspect Ratio Bucketing based on 768p resolution
multires noise
lr: 1e-5
precision: fp32