VAE for high-resolution image generation with stable diffusion
This VAE is trained by adding only one step of noise to the latent and denoising the latent with U-net, to avoid oversensitivity to latent. This process reduces the possibility to describe too much detail in some objects, such as plants and eyes, etc., than in the surroundings when generated at high resolution. The dataset consists of 19k images tagged nijijourneyv5 and published on the web, and was denoised using the same dataset trained models.
sample
training details
- base model: VAE developed by CompVis
- 19k 1images
- 2 epochs
- Aspect Ratio Bucketing based on 768p resolution
- multires noise
- lr: 1e-5
- precision: fp32