torch.cuda.OutOfMemoryError

#26
by shiwanglai - opened

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 23.65 GiB total capacity; 5.93 GiB already allocated; 122.56 MiB free; 5.93 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Hi @shiwanglai
thanks for the issue ! can you share the snippet you're using ?

@ybelkada
~/nlp/lm-evaluation-harness$ python lm_eval/main.py --model=hf --model_args pretrained=google/gemma-2b,load_in_4bit=True --tasks wikitext --batch_size 1
is going OOM not sure what's going on.
same
~/nlp/lm-evaluation-harness$ python lm_eval/main.py --model=hf --model_args pretrained=google/gemma-2b --tasks wikitext --batch_size 1

same with gemma-7b:

File "/home/vincent/miniconda3/envs/pt2.1.0/lib/python3.11/site-packages/transformers/models/gemma/modeling_gemma.py", line 1088, in forward
logits = logits.float()
^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.81 GiB. GPU 0 has a total capacity of 23.67 GiB of which 6.05 GiB is free. Including non-PyTorch memory, this process has 15.88 GiB memory in use. Of the allocated memory 13.06 GiB is allocated by PyTorch, and 2.52 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

I reduced the max_length but there is still issues with gemma-7b (and gemma-2b is much higher than phi-2)

hf (pretrained=google/gemma-7b,max_length=512), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1

Tasks Version Filter n-shot Metric Value Stderr
wikitext 2 none None word_perplexity 42455038.3994 ± N/A
none None byte_perplexity 26.6969 ± N/A
none None bits_per_byte 4.7386 ± N/A

hf (pretrained=google/gemma-7b-it,max_length=512), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1

Tasks Version Filter n-shot Metric Value Stderr
wikitext 2 none None word_perplexity 1795.5652 ± N/A
none None byte_perplexity 4.0602 ± N/A
none None bits_per_byte 2.0216 ± N/A

hf (pretrained=google/gemma-7b,max_length=256), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1

Tasks Version Filter n-shot Metric Value Stderr
wikitext 2 none None word_perplexity 41037962.2523 ± N/A
none None byte_perplexity 26.5280 ± N/A
none None bits_per_byte 4.7294 ± N/A

hf (pretrained=google/gemma-2b,max_length=512), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1

Tasks Version Filter n-shot Metric Value Stderr
wikitext 2 none None word_perplexity 55.9289 ± N/A
none None byte_perplexity 2.1223 ± N/A
none None bits_per_byte 1.0857 ± N/A

hf (pretrained=google/gemma-2b-it,max_length=512), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1

Tasks Version Filter n-shot Metric Value Stderr
wikitext 2 none None word_perplexity 242.5852 ± N/A
none None byte_perplexity 2.7924 ± N/A
none None bits_per_byte 1.4815 ± N/A

Actually, I faced OOM problem when using the DPO trainer for funetuning Gemma-2-2b-it, with 40G memory GPU and batchsize=2. Interesting.

Sign up or log in to comment