CUDA out of memory

#37

by TheAlphaGhost - opened Aug 7, 2024

Aug 7, 2024

I have three Geforce 1080ti, and I got:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 108.00 MiB. GPU 0 has a total capacity of 10.90 GiB of which 87.38 MiB is free. Including non-PyTorch memory, this process has 10.81 GiB memory in use. Of the allocated memory 9.70 GiB is allocated by PyTorch, and 982.21 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Appstane

Aug 7, 2024

Did you find any way to assign multiple GPUs?

TheAlphaGhost

Aug 7, 2024

Not yet, i must first resolve the out of memory issue...

Appstane

Aug 8, 2024

@TheAlphaGhost You can solve memory issue by:

from optimum.quanto import freeze, qfloat8, quantize
from diffusers import FlowMatchEulerDiscreteScheduler, AutoencoderKL
from diffusers.models.transformers.transformer_flux import FluxTransformer2DModel
from diffusers.pipelines.flux.pipeline_flux import FluxPipeline
from transformers import CLIPTextModel, CLIPTokenizer,T5EncoderModel, T5TokenizerFast

dtype = torch.bfloat16
bfl_repo = "black-forest-labs/FLUX.1-dev"

scheduler = FlowMatchEulerDiscreteScheduler.from_pretrained(bfl_repo, subfolder="scheduler")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder_2 = T5EncoderModel.from_pretrained(bfl_repo, subfolder="text_encoder_2", torch_dtype=dtype)
tokenizer_2 = T5TokenizerFast.from_pretrained(bfl_repo, subfolder="tokenizer_2", torch_dtype=dtype)
vae = AutoencoderKL.from_pretrained(bfl_repo, subfolder="vae", torch_dtype=dtype)
transformer = FluxTransformer2DModel.from_pretrained(bfl_repo, subfolder="transformer", torch_dtype=dtype)

quantize(transformer, weights=qfloat8)
freeze(transformer)

quantize(text_encoder_2, weights=qfloat8)
freeze(text_encoder_2)

pipeline = FluxPipeline(
scheduler=scheduler,
text_encoder=text_encoder,
tokenizer=tokenizer,
text_encoder_2=text_encoder_2,
tokenizer_2=tokenizer_2,
vae=vae,
transformer=transformer,
)
pipeline.enable_model_cpu_offload()

ABDALLALSWAITI

Aug 9, 2024

•

edited Aug 9, 2024

try this merged quantized model version of FLUX https://civitai.com/models/629858

TheAlphaGhost

Aug 9, 2024

@ABDALLALSWAITI Thanks, but how I can load the model ?

ABDALLALSWAITI

Aug 9, 2024

Use comfyui or stable swarm webui , I didn't check if flux diffuser library has load from single file feature

TheAlphaGhost

Aug 9, 2024

@ABDALLALSWAITI Thanks, will try now…

ernestyalumni

Aug 10, 2024

pipeline.enable_cpu_sequential_offload() remember to do that as well. Example: https://github.com/InServiceOfX/InServiceOfX/blob/master/PythonLibraries/HuggingFace/MoreDiffusers/morediffusers/Applications/terminal_only_finite_loop_flux.py and I configure both the enable_model_cpu_offload() and enable_cpu_squential_offload() here: https://github.com/InServiceOfX/InServiceOfX/blob/master/Configurations/HuggingFace/MoreDiffusers/flux_pipeline_configuration.yml.example

RogerRose

5 days ago

Running FLUX.1-dev Image Generation with Memory Optimization on my Nvidia GTX 1070 8GB GPU

This guide explains how to run the FLUX.1-dev image generation model with various memory optimizations to handle GPU memory constraints.

Setup and Imports

import os
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'
import torch
from diffusers import FluxPipeline

The first lines set up our environment:

Setting PYTORCH_CUDA_ALLOC_CONF helps prevent memory fragmentation
We import PyTorch and the FluxPipeline from the diffusers library

Pipeline Configuration

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16,
    use_safetensors=True
)

Here we configure the pipeline with several optimizations:

torch_dtype=torch.bfloat16 uses 16-bit precision to reduce memory usage
use_safetensors=True enables more efficient model loading

Memory Optimizations

torch.cuda.empty_cache()
pipe.enable_attention_slicing()
pipe.enable_sequential_cpu_offload()

These lines implement three key memory-saving techniques:

empty_cache() clears unused CUDA memory
enable_attention_slicing() processes attention in smaller chunks
enable_sequential_cpu_offload() moves unused model components to CPU

Image Generation

prompt = "A cat holding a sign that says hello world"
image = pipe(
    prompt,
    height=160,
    width=160,
    guidance_scale=3.5,
    num_inference_steps=50,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0)
).images[0]

The generation parameters are configured for memory efficiency:

Small image dimensions (160x160) to minimize memory usage
guidance_scale=3.5 controls how closely the image follows the prompt
num_inference_steps=50 determines generation quality
max_sequence_length=512 limits the prompt token length
Setting a manual seed ensures reproducible results

Saving the Result

image.save("flux-dev.png")

Finally, we save the generated image to a PNG file.

Memory Usage Tips

If you're still experiencing memory issues, you can try:

Further reducing image dimensions
Decreasing the number of inference steps (try 30-40)
Lowering the max_sequence_length if using shorter prompts
Adjusting the guidance_scale (lower values use less memory)

Complete Code

Here's the complete code block for easy copying:

import os
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'
import torch
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16,
    use_safetensors=True
)

# Memory optimizations
torch.cuda.empty_cache()
pipe.enable_attention_slicing()
pipe.enable_sequential_cpu_offload()

prompt = "A cat holding a sign that says hello world"
image = pipe(
    prompt,
    height=160,
    width=160,
    guidance_scale=3.5,
    num_inference_steps=50,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0)
).images[0]

image.save("flux-dev.png")

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment