Which options to clone the space?

#31
by NoidoDev - opened

What's the cheapest GPU for cloning the space? I tried some, but it needed too much vRAM. But at the same time some people here claim it runs on 24GB. Am I understanding correctly that it needs more than 50GB?

It runs fine on 24GB. I'm not sure how it's loaded when you clone the space, but running it locally with 24gb is fine

Okay, I'll try again and post the errors here if it doesn't work.

===== Application Startup at 2024-08-06 22:05:51 =====

/home/user/.pyenv/versions/3.10.14/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch .library.impl_abstract("xformers_flash::flash_fwd")
/home/user/.pyenv/versions/3.10.14/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch .library.impl_abstract("xformers_flash::flash_bwd")
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling transformers.utils.move_cache().

0it [00:00, ?it/s]
0it [00:00, ?it/s]

Loading pipeline components...: 0%| | 0/7 [00:00<?, ?it/s]You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers

Loading pipeline components...: 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 6/7 [00:01<00:00, 5.20it/s]
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00, 6.35it/s]
Loading pipeline components...: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:01<00:00, 4.59it/s]
Traceback (most recent call last):
File "/home/user/app/app.py", line 11, in
pipe = DiffusionPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=dtype).to(device)
File "/home/user/.pyenv/versions/3.10.14/lib/python3.10/site-packages/diffusers/pipelines/pipeline_utils.py", line 431, in to
module.to(device, dtype)
File "/home/user/.pyenv/versions/3.10.14/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1174, in to
return self._apply(convert)
File "/home/user/.pyenv/versions/3.10.14/lib/python3.10/site-packages/torch/nn/modules/module.py", line 780, in _apply
module._apply(fn)
File "/home/user/.pyenv/versions/3.10.14/lib/python3.10/site-packages/torch/nn/modules/module.py", line 780, in _apply
module._apply(fn)
File "/home/user/.pyenv/versions/3.10.14/lib/python3.10/site-packages/torch/nn/modules/module.py", line 780, in _apply
module._apply(fn)
[Previous line repeated 1 more time]
File "/home/user/.pyenv/versions/3.10.14/lib/python3.10/site-packages/torch/nn/modules/module.py", line 805, in _apply
param_applied = fn(param)
File "/home/user/.pyenv/versions/3.10.14/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in convert
return t.to(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 54.00 MiB. GPU 0 has a total capacity of 21.96 GiB of which 47.06 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 21.72 GiB is allocated by PyTorch, and 9.69 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
/home/user/.pyenv/versions/3.10.14/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch .library.impl_abstract("xformers_flash::flash_fwd")
/home/user/.pyenv/versions/3.10.14/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch .library.impl_abstract("xformers_flash::flash_bwd")
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling transformers.utils.move_cache().

0it [00:00, ?it/s]
0it [00:00, ?it/s]

During inference, it uses up a bit more VRAM after you load the model initially. Try adding this line "pipeline.enable_model_cpu_offload()" right after you initialize your pipeline. Any extra memory it takes up for inference will be sent to the CPU. You can also try setting the 'device_map = "balanced"' though, in my experience, that didn't help.

You might want to update your transformers library from 4.22.0 to the new 4.44.0 due to this being a more recent model.

Here is the code to quantize the model to fp8. It'll run on 16GB of VRAM without quality loss (will slowly use up ~50GB of RAM, then it'll send it to the GPU, using up just 16GB VRAM):

from optimum.quanto import freeze, qfloat8, quantize
from diffusers import FlowMatchEulerDiscreteScheduler, AutoencoderKL
from diffusers.models.transformers.transformer_flux import FluxTransformer2DModel
from diffusers.pipelines.flux.pipeline_flux import FluxPipeline
from transformers import CLIPTextModel, CLIPTokenizer,T5EncoderModel, T5TokenizerFast

dtype = torch.bfloat16

bfl_repo = "black-forest-labs/FLUX.1-dev"

scheduler = FlowMatchEulerDiscreteScheduler.from_pretrained(bfl_repo, subfolder="scheduler")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder_2 = T5EncoderModel.from_pretrained(bfl_repo, subfolder="text_encoder_2", torch_dtype=dtype)
tokenizer_2 = T5TokenizerFast.from_pretrained(bfl_repo, subfolder="tokenizer_2", torch_dtype=dtype)
vae = AutoencoderKL.from_pretrained(bfl_repo, subfolder="vae", torch_dtype=dtype)
transformer = FluxTransformer2DModel.from_pretrained(bfl_repo, subfolder="transformer", torch_dtype=dtype)

quantize(transformer, weights=qfloat8)
freeze(transformer)

quantize(text_encoder_2, weights=qfloat8)
freeze(text_encoder_2)

pipeline = FluxPipeline(
    scheduler=scheduler,
    text_encoder=text_encoder,
    tokenizer=tokenizer,
    text_encoder_2=text_encoder_2,
    tokenizer_2=tokenizer_2,
    vae=vae,
    transformer=transformer,
)
pipeline.enable_model_cpu_offload()

Not sure if I can use that on HF, but I'll look into it. Right now my free GPU availability here got much better.

Didn't work and I wrap it up. Maybe I will clone a space of someone makes some, but currently it's even not necessary. There's enough GPU available.

NoidoDev changed discussion status to closed

Sign up or log in to comment