For the fastest inference on 12GB VRAM, are the following GGUF models appropriate to use?

#4
by ViratX - opened

Please if anyone can confirm or explain this:
Do the file size directly corresponds to the amount of VRAM that they will take up?
In order to have the fastest inference possible, is the goal to have all the models loaded within the GPU - VRAM itself?

1.) flux1-dev-Q4_K_S.gguf - 6.81 GB
2.) t5-v1_1-xxl-encoder-Q5_K_S.gguf - 3.29 GB
3.) clip_l.safetensors - 234 MB

Which make a total of about 10.5 GB.
Leaving 1-1.5GB VRAM as room for inference calculations.
Monitor is connected to iGPU and Browser's Hardware Acceleration has been turned off.

Same question here. And in the Model card they say at least use the Q5_K_M for the t5 encoder is that correct?

Actually, as long as every part is less than your vram, you can just load them in gpu one at a time. Yeah it will take extra time for load/unload, but with a ssd it's not too much.

Actually, as long as every part is less than your vram, you can just load them in gpu one at a time. Yeah it will take extra time for load/unload, but with a ssd it's not too much.

yeah i'm shocked to be generating 768p images in ~70s using flux dev on a 4gb graphics card

Sign up or log in to comment