A few questions
- Why does this model require more resources than original one? https://huggingface.co./kandinsky-community/kandinsky-3
- Why do you download
kandinsky3.pt
andmovq.pt
each time instead of loading from hf cache directory? - Where do you store
kandinsky3.pt
andmovq.pt
? - How to run the inference on RTX 4090? I was barely able to load it adding more RAM, but the inference does not work, even distribute on 2 RTX 4090 like this:
import torch
from accelerate import PartialState
from diffusers import DiffusionPipeline
distributed_state = PartialState()
distributed_state.num_processes = 2
t2i_pipe = get_T2I_pipeline(distributed_state.device, fp16=True)
Inference on A100 is horribly slow...
I was able to run it on 3060 12gb by changing T5Model to T5EncoderModel with 4bit stuff (making sure compute dtype is bfloat16 because original T5 checkpoints didn't really like fp16) and loading models sequentially, as in: first load an encoder, encode text, unload and clear cache, load unet, inference, unload and clear cache, load movqgan and decode. But I think NF4 is biting me in the ass changing embeddings a bit too much for this model. It's not that good to begin with...
P.S. loading T5Encoder takes like 30-40s, while encoding is done in 2s. UNet constructor is 10s and loading is 7. MoVQGAN eats almost all my VRAM decoding 1024x1024.
@kopyl I can't run the original encoder, so can't say but model is the exact same. There were similar tests for DeepFloyd's IF (ImagenFree) and it did change the outputs (4 vs 6 vs 16). From my experience, this model is either undertrained or NF4 ruins the understanding. There also might be a "bug" of using fp16 for the encoder during training, which would also change the embeddings.