Huge memory consumption with SD3.5-medium
According to the picture here, SD3.5-medium should work fine on 10GB vRAM
https://stability.ai/news/introducing-stable-diffusion-3-5
However, my test program fails on a g4dn.xlarge AWS instance, it has 4C/16G + 48G swap, and a Tesla T4 CPU with 16GB vRAM. It runs out of memory due to CUDA couldn't allocate more memory. From nvidia-smi it already took ~15GB memory, and couldn't complete even one picture.
I'm wondering what's wrong here?
Attached fill source code.
import os
import json
import torch
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained("./stable-diffusion-3.5-medium/")
if torch.cuda.is_available():
print('use cuda')
pipe = pipe.to("cuda")
elif torch.mps.is_available():
print('use mps')
pipe = pipe.to('mps')
else:
print('use cpu')
data = []
with open('data.json', 'r') as f:
data = json.load(f)
os.makedirs('output', exist_ok=True)
for row in data:
prompt = '%s, style is %s, light is %s' % (row['prompt'], row['style'], row['light'])
filename = 'output/%s.png' % (row['uuid'])
height = 1280
width = 1280
if row['aspect_ratio'] == '16:9':
width = 720
elif row['aspect_ratio'] == '9:16':
width = 720
height = 1280
print('saving', filename)
image = pipe(prompt, height=height, width=width).images[0]
image.save(filename)
did it resolve for you
@yue32000
@oddball516
The reason is because of the T5 text encoder, you can resolve it with
pipe.enable_model_cpu_offload()
@YaTharThShaRma999 Do you know how enable_model_cpu_offload() works? Are you saying the T5 model will be offloaded to non-gpu memory?
@oddball516 yeah kinda, when it’s needed however, it will be moved back to gpu for faster computation. After it’s done computing(1-2s), it will be moved back to cpu.
It’s very big, infact bigger then the real image gen model(4b vs 2b) itself but only used one time per image and is fast.
I want to be the best
With these code,
import os
import json
import torch
from diffusers import StableDiffusion3Pipeline
pipe = StableDiffusion3Pipeline.from_pretrained(
"stabilityai/stable-diffusion-3.5-medium",
ignore_mismatched_sizes=True,
low_cpu_mem_usage=False,
torch_dtype=torch.float16,
variant="fp16"
)
pipe = pipe.to('cuda')
pipe.enable_xformers_memory_efficient_attention()
pipe.enable_model_cpu_offload()
Still OOM on T4,
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 160.00 MiB. GPU 0 has a total capacity of 14.57 GiB of which 154.75 MiB is free. Including non-PyTorch memory, this process has 14.41 GiB memory in use. Of the allocated memory 14.05 GiB is allocated by PyTorch, and 268.57 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)