How can I use multiple gpu's?

#35
by nnnian - opened

Now that I'm running with only one gpu, I've tried using the device_map,but it not worked,how do I get my full gpu to reason at the same time?

I manually mapped out all of the tensors and finally got it to work.. THEN it told me that it was expecting all of the tensors to be on 1 GPU -_- 6hrs wasted. Best there is is this line "pipeline.enable_model_cpu_offload()" which offloads some of the extra memory to the CPU for the extra memory that Inference uses up. The other one is to use this code for 8bit quantization. It drops the VRAM usage to 16GB (You don't lose quality) [NOTE: will slowly use up ~50GB of RAM, then it'll send it to the GPU, using up just 16GB VRAM]:

from optimum.quanto import freeze, qfloat8, quantize
from diffusers import FlowMatchEulerDiscreteScheduler, AutoencoderKL
from diffusers.models.transformers.transformer_flux import FluxTransformer2DModel
from diffusers.pipelines.flux.pipeline_flux import FluxPipeline
from transformers import CLIPTextModel, CLIPTokenizer,T5EncoderModel, T5TokenizerFast

dtype = torch.bfloat16
bfl_repo = "black-forest-labs/FLUX.1-dev"

scheduler = FlowMatchEulerDiscreteScheduler.from_pretrained(bfl_repo, subfolder="scheduler")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder_2 = T5EncoderModel.from_pretrained(bfl_repo, subfolder="text_encoder_2", torch_dtype=dtype)
tokenizer_2 = T5TokenizerFast.from_pretrained(bfl_repo, subfolder="tokenizer_2", torch_dtype=dtype)
vae = AutoencoderKL.from_pretrained(bfl_repo, subfolder="vae", torch_dtype=dtype)
transformer = FluxTransformer2DModel.from_pretrained(bfl_repo, subfolder="transformer", torch_dtype=dtype)

quantize(transformer, weights=qfloat8)
freeze(transformer)

quantize(text_encoder_2, weights=qfloat8)
freeze(text_encoder_2)

pipeline = FluxPipeline(
    scheduler=scheduler,
    text_encoder=text_encoder,
    tokenizer=tokenizer,
    text_encoder_2=text_encoder_2,
    tokenizer_2=tokenizer_2,
    vae=vae,
    transformer=transformer,
)
pipeline.enable_model_cpu_offload()

When I use "pipe.enable_sequential_cpu_offload()" to run, it worked success,but its too slow,I have 8 gpus, how can I use full gpus?

also not sure if you wouldn't need to use

text_encoder_2 = None,

and

transformer = None 

when defining the pipeline and then later on:

pipeline.text_encoder_2 = text_encoder_2
pipeline.transformer = transformer

Otherwise you might not use your quantized models?! But nor sure about this.

pipeline = FluxPipeline(
    scheduler=scheduler,
    text_encoder=text_encoder,
    tokenizer=tokenizer,
    text_encoder_2=None,
    tokenizer_2=tokenizer_2,
    vae=vae,
    transformer=None,
)

pipeline.text_encoder_2 = text_encoder_2
pipeline.transformer = transformer

pipeline.enable_model_cpu_offload()

I have 5 GPUs and it keeps trying to load onto GPU 0 only. Anyone else figure this out

any success? I have two GPUs and had no luck with this problem, wasted whole day trying to resolve it

Sign up or log in to comment