Can we have the NF4 version without the t5xxl & clip please for better speed in low VRAM (<8GB) GPUs?

#13
by dataandmind - opened

With t5xxl & clip are loaded separately, the flux dev fp8 version performs at ~12s/iteration on my RTX 4070 8GB laptop. However, with the fp8 version that includes these, a single iteration takes more than 90s on average.
I believe it would be the same story for the NF4 version. I have tried this NF4 version and it is taking ~54s per iteration. I believe it is because of the way CUDA offloads the models into system RAM. If there are multiple weights, it can do this easily otherwise it is trying to handle everything together and it slows down the execution.

I found this version that is less than 8GB - https://huggingface.co./sayakpaul/flux.1-dev-nf4/tree/main. Going to give it a try.

I found this version that is less than 8GB - https://huggingface.co./sayakpaul/flux.1-dev-nf4/tree/main. Going to give it a try.

What checkpoint loader did you use ... it is not working with my nf4 loader as V1 or V2

I tried it and it is not working in any of the checkpoint loader. I moved to GGUF version , even q8 is running good in my 4070-8GB when the model (unet gguf), clip &t5xxl are loaded separately.

Sign up or log in to comment