Huge memory consumption with SD3.5-medium

#18

by oddball516 - opened Nov 26, 2024

Nov 26, 2024

According to the picture here, SD3.5-medium should work fine on 10GB vRAM
https://stability.ai/news/introducing-stable-diffusion-3-5

However, my test program fails on a g4dn.xlarge AWS instance, it has 4C/16G + 48G swap, and a Tesla T4 CPU with 16GB vRAM. It runs out of memory due to CUDA couldn't allocate more memory. From nvidia-smi it already took ~15GB memory, and couldn't complete even one picture.

I'm wondering what's wrong here?

Attached fill source code.

import os
import json
import torch

from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained("./stable-diffusion-3.5-medium/")
if torch.cuda.is_available():
    print('use cuda')
    pipe = pipe.to("cuda")
elif torch.mps.is_available():
    print('use mps')
    pipe = pipe.to('mps')
else:
    print('use cpu')

data = []
with open('data.json', 'r') as f:
    data = json.load(f)

os.makedirs('output', exist_ok=True)
for row in data:
    prompt   = '%s, style is %s, light is %s' % (row['prompt'], row['style'], row['light'])
    filename = 'output/%s.png' % (row['uuid'])
    height   = 1280
    width    = 1280
    
    if row['aspect_ratio'] == '16:9':
        width = 720
    elif row['aspect_ratio'] == '9:16':
        width = 720
        height = 1280
    
    print('saving', filename)
    image = pipe(prompt, height=height, width=width).images[0]
    image.save(filename)

yue32000

Jan 5

did it resolve for you

YaTharThShaRma999

Jan 5

@yue32000 @oddball516
The reason is because of the T5 text encoder, you can resolve it with
pipe.enable_model_cpu_offload()

oddball516

Jan 30

@YaTharThShaRma999 Do you know how enable_model_cpu_offload() works? Are you saying the T5 model will be offloaded to non-gpu memory?

YaTharThShaRma999

Jan 30

@oddball516 yeah kinda, when it’s needed however, it will be moved back to gpu for faster computation. After it’s done computing(1-2s), it will be moved back to cpu.

It’s very big, infact bigger then the real image gen model(4b vs 2b) itself but only used one time per image and is fast.

wonderlus

Jan 30

•

edited Jan 30

I want to be the best

oddball516

Jan 31

•

edited Jan 31

With these code,

import os
import json
import torch

from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3.5-medium", 
    ignore_mismatched_sizes=True,
    low_cpu_mem_usage=False,
    torch_dtype=torch.float16,
    variant="fp16"
)
pipe = pipe.to('cuda')
pipe.enable_xformers_memory_efficient_attention()
pipe.enable_model_cpu_offload()

Still OOM on T4,

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 160.00 MiB. GPU 0 has a total capacity of 14.57 GiB of which 154.75 MiB is free. Including non-PyTorch memory, this process has 14.41 GiB memory in use. Of the allocated memory 14.05 GiB is allocated by PyTorch, and 268.57 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

YaTharThShaRma999

Feb 4

@oddball516 Don't do pipe.to('cuda'), just do

import os
import json
import torch

from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3.5-medium", 
    ignore_mismatched_sizes=True,
    low_cpu_mem_usage=False,
    torch_dtype=torch.float16,
    variant="fp16"
)
pipe.enable_xformers_memory_efficient_attention()
pipe.enable_model_cpu_offload()

ahmEdimrann

26 days ago

Yes, I am facing similar issues as well. @YaTharThShaRma999 's code isn't working for me

ahmEdimrann

26 days ago

@oddball516 You found any solution?

ahmEdimrann

26 days ago

I am using kaggle still despite 2x T4 I am having issues with both large and medium quantized and non quantized versions

https://www.kaggle.com/code/ahmedimrann/testing-stable-diffusion

blaZ3

15 days ago

Not using pipe.to('cuda') along with enable_model_cpu_offload worked , I was able to run even the large model in 24 GB VRAM, it took around 50 seconds to generate though.

oddball516

9 days ago

•

edited 9 days ago

@ahmEdimrann Amazing, YaTharThShaRma999's solution worked for me. I fired up an T4 and L40 instance today and verified it.

GPU vRAM is 6~12GB of 24GB, MEM + SWAP is 16GB of 40GB. The GPU vRAM reduced when inferencing, not sure why.
Took 80 seconds to load all components
Took 60~90 seconds for an image

The performance is similar, both instance works (16G / 24G vRAM)

oddball516 changed discussion status to closed 8 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment