--- license: other pipeline_tag: image-to-image --- # StableSR Model Card This model card focuses on the models associated with the StableSR, available [here](https://github.com/IceClear/StableSR). ## Model Details - **Developed by:** Jianyi Wang - **Model type:** Diffusion-based image super-resolution model - **License:** [S-Lab License 1.0](https://github.com/IceClear/StableSR/blob/main/LICENSE.txt) - **Model Description:** This is the model used in [Paper](https://arxiv.org/abs/2305.07015). - **Resources for more information:** [GitHub Repository](https://github.com/IceClear/StableSR). - **Cite as:** @InProceedings{wang2023exploiting, author = {Wang, Jianyi and Yue, Zongsheng and Zhou, Shangchen and Chan, Kelvin CK and Loy, Chen Change}, title = {Exploiting Diffusion Prior for Real-World Image Super-Resolution}, booktitle = {arXiv preprint arXiv:2305.07015}, year = {2023}, } # Uses Please refer to [S-Lab License 1.0](https://github.com/IceClear/StableSR/blob/main/LICENSE.txt) ## Limitations and Bias ### Limitations - TBD ### Bias While our model is based on a pre-trained Stable Diffusion model, currently we do not observe obvious bias in generated results. We conjecture the main reason is that our model does not rely on text prompts but on low-resolution images. Such strong conditions make our model less likely to be affected. ## Training **Training Data** The model developer used the following dataset for training the model: - Our diffusion model is finetuned on DF2K (DIV2K and Flickr2K) + OST datasets, available [here](https://github.com/xinntao/Real-ESRGAN/blob/master/docs/Training.md). - We further generate 100k synthetic LR-HR pairs on DF2K_OST using the finetuned diffusion model for training the CFW module. **Training Procedure** StableSR is an image super-resolution model finetuned on [Stable Diffusion](https://github.com/Stability-AI/stablediffusion), further equipped with a time-aware encoder and a controllable feature wrapping (CFW) module. - Following Stable Diffusion, images are encoded through the fixed VQGAN encoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4. - The latent representations are fed to the time-aware encoder as guidance. - The loss is the same as Stable Diffusion. - After finetuning the diffusion model, we further train the CFW module using the data generated by the finetuned diffusion model. - The VQGAN model is fixed and only CFW is trainable. - The loss is similar to training a VQGAN, except that we use a fixed adversarial loss weight of 0.025 rather than a self-adjustable one. We currently provide the following checkpoints: - [stablesr_000117.ckpt](https://huggingface.co./Iceclear/StableSR/resolve/main/stablesr_000117.ckpt): Diffusion model finetuned on DF2K_OST dataset for 117 epochs. - [vqgan_cfw_00011.ckpt](https://huggingface.co./Iceclear/StableSR/resolve/main/vqgan_cfw_00011.ckpt): CFW module with fixed VQGAN trained on synthetic paired data for 11 epochs. ## Evaluation Results See [Paper](https://arxiv.org/abs/2305.07015) for details.