🚩 Report: Not working

#91
by alystear - opened

Getting a file not found for pytorch.bin error. HF tells me I run out of CUDA memory, but I tried with both 16 and 30 gigs Nvidia T4 TPUs and same error. Logs show it's a file not found error. Testing with AG10 (someone in discussions claimed it worked for them when using one) and will report back.

I get the following error using an AG10 small.

Enabling memory efficient attention with xformers...
Could not enable memory efficient attention. Make sure xformers is installed correctly and a GPU is available: No operator found for memory_efficient_attention_forward with inputs:
query : shape=(1, 2, 1, 40) (torch.float32)
key : shape=(1, 2, 1, 40) (torch.float32)
value : shape=(1, 2, 1, 40) (torch.float32)
attn_bias : <class 'NoneType'>
p : 0.0
flshattF is not supported because:
xFormers wasn't build with CUDA support
dtype=torch.float32 (supported: {torch.float16, torch.bfloat16})
tritonflashattF is not supported because:
xFormers wasn't build with CUDA support
dtype=torch.float32 (supported: {torch.float16, torch.bfloat16})
requires A100 GPU
cutlassF is not supported because:
xFormers wasn't build with CUDA support
smallkF is not supported because:
xFormers wasn't build with CUDA support
max(query.shape[-1] != value.shape[-1]) > 32
unsupported embed per head: 40
/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/diffusers/configuration_utils.py:195: FutureWarning: It is deprecated to pass a pretrained model name or path to from_config.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_ddpm.DDPMScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)

Nothing happened for 10 minutes so I ended the space.

I also get an error when training with a T4 small. However, I have no error message as the log was already cleared when I wanted to check it. Happened three times in a row now.
Will now test with a T4 medium.

Ok, I was able to get a stack trace this time (on a T4 medium):

/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/diffusers/configuration_utils.py:195: FutureWarning: It is deprecated to pass a pretrained model name or path to from_config.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_ddpm.DDPMScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)
thierry thierry thierry Adding Safety Checker to the model...
Traceback (most recent call last):
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/gradio/routes.py", line 337, in run_predict
output = await app.get_blocks().process_api(
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/gradio/blocks.py", line 1015, in process_api
result = await self.call_function(
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/gradio/blocks.py", line 833, in call_function
prediction = await anyio.to_thread.run_sync(
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/gradio/helpers.py", line 584, in tracked_fn
response = fn(*args)
File "/home/user/app/app.py", line 344, in train
push(model_name, where_to_upload, hf_token, which_model, True)
File "/home/user/app/app.py", line 364, in push
convert("output_model", "model.ckpt")
File "/home/user/app/convertosd.py", line 270, in convert
unet_state_dict = torch.load(unet_path, map_location="cpu")
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 699, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 230, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 211, in init
super(_open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'output_model/unet/diffusion_pytorch_model.bin'

Ok, my problem may be this: https://huggingface.co./spaces/multimodalart/dreambooth-training/discussions/88

Yes, I was referencing this post in my second comment. Please do let me know if the proposed solution of using an AG10 works for you, as I couldn't get it working.

Actually, I think it may work with an AG10. I saw that, just like yours, the container log seemed to be stuck for ages. But then it eventually continued! Unfortunately, my system was shut down due to my inactivity settings. I will try again with a longer inactivity timeout.

However, I also noticed that there were a bunch of "CUDA Out of memory" errors in my notification inbox. But I am unsure which of my trial runs they belong to.

I'll let you know once I made my next attempt... Should be shortly.

Oh my god, yes, it took ages, but it is indeed working with an AG10! I set my sleep timeout to 10 hours and it worked!

Oh my god, yes, it took ages, but it is indeed working with an AG10! I set my sleep timeout to 10 hours and it worked!

Thank you, I'll close this issue since it worked out for you.

alystear changed discussion status to closed

Sign up or log in to comment