Use pretrain VAE to encode a 512x512 image to latent space get nan, the image has been normalized to [-1,1]
I try to fine-tuning the upscaler model with my own data, however, I find when I encode the 512x512 image to latent space 128x128 with the pretrain VAE parameter, I get nan with size [b,4,128,128].
I have tracked the VAE forward function. I find that following the calculation map, the data will soon become huge and data overflow will happen.
I use the stable diffusion fine-tuning script in the following link and modify the script with my own dataset since there is no finetuning script for this x4-upscaler model.
https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py
Is there any solution for this error?
Thanks for your help, this works. It seems when training text_to_img model, VAE model with --mixed_precision="fp16" can work fine. But for x4-upscaler model, just set VAE to torch.float16 will overflow.
It seems that the 4x scaler vae will generate intermediate activation tensor with extreme values of 1e7-1e8