Disty0's picture
Update README.md
099ad10 verified
metadata
pipeline_tag: text-to-image
license: other
license_name: stable-cascade-nc-community
license_link: LICENSE

SoteDiffusion Cascade

Anime finetune of Stable Cascade.
Currently is in very early state in training.
No commercial use thanks to StabilityAI.

Code Example

pip install diffusers
import torch
from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline

prompt = "newest, 1girl, solo, cat ears, looking at viewer, blush, light smile,"
negative_prompt = "very displeasing, worst quality, monochrome, sketch, fat, child,"

prior = StableCascadePriorPipeline.from_pretrained("Disty0/sote-diffusion-cascade_alpha0", torch_dtype=torch.float16)
decoder = StableCascadeDecoderPipeline.from_pretrained("Disty0/sote-diffusion-cascade-decoder_alpha0", torch_dtype=torch.float16)

prior.enable_model_cpu_offload()
prior_output = prior(
    prompt=prompt,
    height=1024,
    width=1024,
    negative_prompt=negative_prompt,
    guidance_scale=7.0,
    num_images_per_prompt=1,
    num_inference_steps=40
)

decoder.enable_model_cpu_offload()
decoder_output = decoder(
    image_embeddings=prior_output.image_embeddings,
    prompt=prompt,
    negative_prompt=negative_prompt,
    guidance_scale=1.5
    output_type="pil",
    num_inference_steps=10
).images[0]
decoder_output.save("cascade.png")

Training Status:

Alpha0 Release: This release resets the training and enables Text Encoder training.

GPU used for training: 1x AMD RX 7900 XTX 24GB

dataset name training done remaining
newest 000 230
recent 000 206
mid 000 201
early 000 055
oldest 000 016
pixiv 000 074
visual novel cg 000 070
anime wallpaper 000 013
Total 8 865

Note: chunks starts from 0 and there are 8000 images per chunk

Dataset:

GPU used for captioning: 1x Intel ARC A770 16GB
Model used for captioning: SmilingWolf/wd-swinv2-tagger-v3
Command:

python /mnt/DataSSD/AI/Apps/kohya_ss/sd-scripts/finetune/tag_images_by_wd14_tagger.py --model_dir "/mnt/DataSSD/AI/models/wd14_tagger_model" --repo_id "SmilingWolf/wd-swinv2-tagger-v3" --recursive --remove_underscore --use_rating_tags --character_tags_first --character_tag_expand --append_tags --onnx --caption_separator ", " --general_threshold 0.35 --character_threshold 0.50 --batch_size 4 --caption_extension ".txt" ./
dataset name total images total chunk
newest 1.843.053 221
recent 1.652.420 207
mid 1.609.608 202
early 442.368 056
oldest 128.311 017
pixiv 594.046 075
visual novel cg 560.903 071
anime wallpaper 106.882 014
Total 6.937.591 873

Note: Smallest size is 1280x600 | 768.000 pixels

Tags:

aesthetic tags, quality tags, date tags, custom tags, rating tags, character tags, rest of the tags

Date:

tag date
newest 2022 to 2024
recent 2019 to 2021
mid 2015 to 2018
early 2011 to 2014
oldest 2005 to 2010

Aesthetic Tags:

Model used: shadowlilac/aesthetic-shadow-v2

score greater than tag
0.90 extremely aesthetic
0.80 very aesthetic
0.70 aesthetic
0.50 slightly aesthetic
0.40 not displeasing
0.30 not aesthetic
0.20 slightly displeasing
0.10 displeasing
rest of them very displeasing

Quality Tags:

Model used: https://huggingface.co./hakurei/waifu-diffusion-v1-4/blob/main/models/aes-B32-v0.pth

score greater than tag
0.980 best quality
0.900 high quality
0.750 great quality
0.500 medium quality
0.250 normal quality
0.125 bad quality
0.025 low quality
rest of them worst quality

Rating Tags

  • general
  • sensitive
  • questionable
  • explicit

Custom Tags:

dataset name custom tag
image boards date,
pixiv art by Display_Name,
visual novel cg Full_VN_Name (short_3_letter_name), visual novel cg,
anime wallpaper date, anime wallpaper,

Training Params:

Software used: Kohya SD-Scripts with Stable Cascade branch
Base model: Disty0/sote-diffusion-cascade_pre-alpha0

Command:

accelerate launch  --mixed_precision fp16 --num_cpu_threads_per_process 1 stable_cascade_train_stage_c.py \
--mixed_precision fp16 \
--save_precision fp16 \
--full_fp16 \
--sdpa \
--gradient_checkpointing \
--train_text_encoder \
--resolution "1024,1024" \
--train_batch_size 2 \
--adaptive_loss_weight \
--learning_rate 4e-6 \
--lr_scheduler constant_with_warmup \
--lr_warmup_steps 100 \
--optimizer_type adafactor \
--optimizer_args "scale_parameter=False" "relative_step=False" "warmup_init=False" \
--max_grad_norm 0 \
--token_warmup_min 1 \
--token_warmup_step 0 \
--shuffle_caption \
--caption_dropout_rate 0 \
--caption_tag_dropout_rate 0 \
--caption_dropout_every_n_epochs 0 \
--dataset_repeats 1 \
--save_state \
--save_every_n_steps 2048 \
--sample_every_n_steps 512 \
--max_token_length 225 \
--max_train_epochs 1 \
--caption_extension ".txt" \
--max_data_loader_n_workers 2 \
--persistent_data_loader_workers \
--enable_bucket \
--min_bucket_reso 256 \
--max_bucket_reso 4096 \
--bucket_reso_steps 64 \
--bucket_no_upscale \
--log_with tensorboard \
--output_name sotediffusion-sc_3b \
--train_data_dir /mnt/DataSSD/AI/anime_image_dataset/combined/combined-0000 \
--in_json /mnt/DataSSD/AI/anime_image_dataset/combined/combined-0000.json \
--output_dir /mnt/DataSSD/AI/SoteDiffusion/StableCascade/sotediffusion-sc_3b-0 \
--logging_dir /mnt/DataSSD/AI/SoteDiffusion/StableCascade/sotediffusion-sc_3b-0/logs \
--resume /mnt/DataSSD/AI/SoteDiffusion/StableCascade/sotediffusion-sc_3b-step00020480-state \
--stage_c_checkpoint_path /mnt/DataSSD/AI/SoteDiffusion/StableCascade/sotediffusion-sc_3b-step00020480.safetensors \
--text_model_checkpoint_path /mnt/DataSSD/AI/SoteDiffusion/StableCascade/sotediffusion-sc_3b-step00020480_text_model.safetensors \
--effnet_checkpoint_path /mnt/DataSSD/AI/models/sd-cascade/effnet_encoder.safetensors \
--previewer_checkpoint_path /mnt/DataSSD/AI/models/sd-cascade/previewer.safetensors \
--sample_prompts /mnt/DataSSD/AI/SoteDiffusion/StableCascade/config/sotediffusion-prompt.txt

Limitations and Bias

Bias

  • This model is intended for anime illustrations.
    Realistic capabilites are not tested at all.
  • Still underbaked.

Limitations

  • Can fall back to realistic.
    Add "realistic" tag to the negatives when this happens.
  • Far shot eyes are still bad thanks to the heavy latent compression.