--- license: openrail++ tags: - stable-diffusion - stable-diffusion-2-1 - text-to-image pinned: true library_name: diffusers --- # Model Card for pseudo-flex-base stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned with different aspect ratios, into a photography model (ptx0/pseudo-real-beta). ## Background The `ptx0/pseudo-real-beta` pretrained checkpoint had its unet trained for 4,200 steps and its text encoder trained for 15,600 steps at a batch size of 15 with 10 gradient accumulations, on a diverse dataset: * cushman (8000 kodachrome slides from 1939 to 1969) * midjourney v5.1-filtered (about 22,000 upscaled v5.1 images) * national geographic (about 3-4,000 >1024x768 images of animals, wildlife, landscapes, history) * a small dataset of stock images of people vaping / smoking It has a diverse capability of photorealistic and adventure with strong prompt coherence. However, it lacks multi-aspect capability. The code used to train `pseudo-real-beta` did not have aspect bucketing support. I discovered `pseudo-flex-base` by @ttj, which supported theories I had. ## Training code I added thorough aspect bucketing support to my training loop dataloader by having it throw away any image under 1024x1024, and condition all images so that the smaller side of the image is 1024. The aspect ratio of the image is used to determine the new length of the other dimension, eg. used as a multiple for landscape or a divisor for portrait mode. All batches have image of the same resolution. Different resolutions at the same aspect are all conditioned to 1024x... or ...x1024. A 1920x1080 image becomes approx 1820x1024. ## Starting checkpoint This model, `pseudo-flex-base` was created by fine-tuning the base `stabilityai/stable-diffusion-2-1` 768 model on its frozen text encoder, for 1000 steps on 148,000 images from LAION HD using the TEXT field as their caption. The batch size was effectively 150 again. Batch size of 15 with 10 accumulations. This is very slow at very high resolutions, an aspect ratio of 1.5-1.7 will cause this to take about 700 seconds per iter on an A100 80G. This training took two days. ## Text encoder swap At 1000 steps, the text encoder from `ptx0/pseudo-real-beta` was used experimentally with this model's unet in an attempt to resolve some residual image noise, eg. pixelation. That worked! The training was restarted from ckpt 1000 with this text encoder. ## The beginnings of wide / portrait aspect appearing Validation prompts began to "pull together" from 1300 to 2950 steps. Some checkpoints show regression, but these usually resolve in about 100 steps. Improvements were always present, despite regresions. ## Degradation and dataset swap As training has been going on for some time now on 148,000 images at a batch size of 150 over 3000 steps, images began to degrade. This is presumably due to having completed 3 repeats on all images in the set, and that's IF all images in the set had been used. Considering some of the image filters discarded about 50,000 images, we landed at 9 repeats per image on our super low learning rate. This caused two issues: * The images were beginning to show static noise. * The training was taking a very long time, and each checkpoint showed little improvement. * Overfitting to prompt vocabulary, and a lack of generalization. Ergo, at 1300 steps, the decision was made to cease training on the original LAION HD dataset, and instead, train on a *new* freshly-retrieved subset of high-resolution Midjourney v5.1 data. This consisted of 17,800 images at a base resolution of 1024x1024, with about 700 samples in portrait and 700 samples in landscape. ## Contrast issues As the checkpoint 3275 was tested, a common observation was that darker images were washed out, and brighter images seemed "meh". Various CFG rescale and guidance levels were tested, with the best dark images occurring around `guidance_scale=9.2` and `guidance_rescale=0.0` but they remained "washed out". ## Dataset change number two A new LAION subset was prepared with unique images and no square images - just a limited collection of aspect ratios: * 16:9 * 9:16 * 2:3 * 3:2 This was intended to speed up the understanding of the model, and prevent overfitting on captions. This LAION subset contained 17,800 images, evenly distributed through aspect ratios. The images were then captioned using T5 Flan with BLIP2, to obtain highly accurate results. ## Contrast fix: offset noise / SNR gamma to the rescue? Offset noise and SNR gamma were applied experimentally to the checkpoint **4250**: * `snr_gamma=5.0` * `noise_offset=0.2` * `noise_pertubation=0.1` Within 25 steps of training, the contrast was back, and the prompt `a solid black square` once again produced a reasonable result. At 50 steps of offset noise, things really seemed to "click" and `a solid black square` had the fewest deformities I've seen. Step 75 checkpoint was broken. The SNR gamma math results in numeric instability and was disabled. The offset noise parameters were untouched. ## Success! Improvement in quality and contrast. Similar to the text encoder swap, the images showed a marked improvement over the next several checkpoints. It was left to its own devices, and at step 4475, enough improvement was observed that another revision in this repository was created. # Status: Test release This model has been packaged up in a test form so that it can be thoroughly assessed by users. For usage, see - [How to Get Started with the Model](#how-to-get-started-with-the-model) ### It aims to solve the following issues: 1. Generated images looks like they are cropped from a larger image. 2. Generating non-square images creates weird results, due to the model being trained on square images. Examples: (WIP) ### Limitations: 1. It's trained on a small dataset, so its improvements may be limited. 2. The model architecture of SD 2.1 is older than SDXL, and will not generate comparably good results. For 1:1 aspect ratio, it's fine-tuned at 1024x1024, although `ptx0/pseudo-real-beta` that it was based on, was last finetuned at 768x768. ### Potential improvements: 1. Train on a captioned dataset. This model used the TEXT field from LAION for convenience, though COCO-generated captions would be superior. 2. Train the text encoder on large images. 3. Periodic caption drop-out enforced to help condition classifier-free guidance capabilities. # Table of Contents - [Model Card for pseudo-flex-base](#model-card-for--model_id-) - [Table of Contents](#table-of-contents) - [Table of Contents](#table-of-contents-1) - [Model Details](#model-details) - [Model Description](#model-description) - [Uses](#uses) - [Direct Use](#direct-use) - [Downstream Use [Optional]](#downstream-use-optional) - [Out-of-Scope Use](#out-of-scope-use) - [Bias, Risks, and Limitations](#bias-risks-and-limitations) - [Recommendations](#recommendations) - [Training Details](#training-details) - [Training Data](#training-data) - [Training Procedure](#training-procedure) - [Preprocessing](#preprocessing) - [Speeds, Sizes, Times](#speeds-sizes-times) - [Evaluation](#evaluation) - [Testing Data, Factors & Metrics](#testing-data-factors--metrics) - [Testing Data](#testing-data) - [Factors](#factors) - [Metrics](#metrics) - [Results](#results) - [Model Examination](#model-examination) - [Environmental Impact](#environmental-impact) - [Technical Specifications [optional]](#technical-specifications-optional) - [Model Architecture and Objective](#model-architecture-and-objective) - [Compute Infrastructure](#compute-infrastructure) - [Hardware](#hardware) - [Software](#software) - [Citation](#citation) - [Glossary [optional]](#glossary-optional) - [More Information [optional]](#more-information-optional) - [Model Card Authors [optional]](#model-card-authors-optional) - [Model Card Contact](#model-card-contact) - [How to Get Started with the Model](#how-to-get-started-with-the-model) # Model Details ## Model Description stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1 and ptx0/pseudo-real-beta) finetuned for dynamic aspect ratios. finetuned resolutions: | | width | height | aspect ratio | images | |---:|--------:|---------:|:--------------|-------:| | 0 | 1024 | 1024 | 1:1 | 90561 | | 1 | 1536 | 1024 | 3:2 | 8716 | | 2 | 1365 | 1024 | 4:3 | 6933 | | 3 | 1468 | 1024 | ~3:2 | 113 | | 4 | 1778 | 1024 | ~5:3 | 6315 | | 5 | 1200 | 1024 | ~5:4 | 6376 | | 6 | 1333 | 1024 | ~4:3 | 2814 | | 7 | 1281 | 1024 | ~5:4 | 52 | | 8 | 1504 | 1024 | ~3:2 | 139 | | 9 | 1479 | 1024 | ~3:2 | 25 | | 10 | 1384 | 1024 | ~4:3 | 1676 | | 11 | 1370 | 1024 | ~4:3 | 63 | | 12 | 1499 | 1024 | ~3:2 | 436 | | 13 | 1376 | 1024 | ~4:3 | 68 | Other aspects were in smaller buckets. It could have been done more succinctly or carefully, but careless handling of the data was a part of the experiment parameters. - **Developed by:** pseudoterminal - **Model type:** Diffusion-based text-to-image generation model - **Language(s)**: English - **License:** creativeml-openrail-m - **Parent Model:** https://huggingface.co./ptx0/pseudo-real-beta - **Resources for more information:** More information needed # Uses - see https://huggingface.co./stabilityai/stable-diffusion-2-1 # Training Details ## Training Data - LAION HD dataset subsets - https://huggingface.co./datasets/laion/laion-high-resolution We only used a small portion of that, see [Preprocessing](#preprocessing) ### Preprocessing All pre-processing is done via the scripts in `bghira/SimpleTuner` on GitHub. ### Speeds, Sizes, Times - Dataset size: 100k image-caption pairs, after filtering. - Hardware: 1 A100 80G GPUs - Optimizer: 8bit Adam - Batch size: 150 - actual batch size: 15 - gradient_accumulation_steps: 10 - effective batch size: 150 - Learning rate: Constant 4e-8 which was adjusted by reducing batch size over time. - Training steps: WIP (ongoing) - Training time: approximately 4 days (so far) ## Results More information needed # Model Card Authors pseudoterminal # How to Get Started with the Model Use the code below to get started with the model. ```python # Use Pytorch 2! import torch from diffusers import StableDiffusionPipeline, DiffusionPipeline, AutoencoderKL, UNet2DConditionModel, DDPMScheduler from transformers import CLIPTextModel # Any model currently on Huggingface Hub. model_id = 'ptx0/pseudo-flex-base' pipeline = DiffusionPipeline.from_pretrained(model_id) # Optimize! pipeline.unet = torch.compile(pipeline.unet) scheduler = DDPMScheduler.from_pretrained( model_id, subfolder="scheduler" ) # Remove this if you get an error. torch.set_float32_matmul_precision('high') pipeline.to('cuda') prompts = { "woman": "a woman, hanging out on the beach", "man": "a man playing guitar in a park", "lion": "Explore the ++majestic beauty++ of untamed ++lion prides++ as they roam the African plains --captivating expressions-- in the wildest national geographic adventure", "child": "a child flying a kite on a sunny day", "bear": "best quality ((bear)) in the swiss alps cinematic 8k highly detailed sharp focus intricate fur", "alien": "an alien exploring the Mars surface", "robot": "a robot serving coffee in a cafe", "knight": "a knight protecting a castle", "menn": "a group of smiling and happy men", "bicycle": "a bicycle, on a mountainside, on a sunny day", "cosmic": "cosmic entity, sitting in an impossible position, quantum reality, colours", "wizard": "a mage wizard, bearded and gray hair, blue star hat with wand and mystical haze", "wizarddd": "digital art, fantasy, portrait of an old wizard, detailed", "macro": "a dramatic city-scape at sunset or sunrise", "micro": "RNA and other molecular machinery of life", "gecko": "a leopard gecko stalking a cricket" } for shortname, prompt in prompts.items(): # old prompt: '' image = pipeline(prompt=prompt, negative_prompt='malformed, disgusting, overexposed, washed-out', num_inference_steps=32, generator=torch.Generator(device='cuda').manual_seed(1641421826), width=1368, height=720, guidance_scale=7.5, guidance_rescale=0.3, num_inference_steps=25).images[0] image.save(f'test/{shortname}_nobetas.png', format="PNG") ```