torch.cuda.OutOfMemoryError

#26

by shiwanglai - opened Feb 28, 2024

Feb 28, 2024

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 23.65 GiB total capacity; 5.93 GiB already allocated; 122.56 MiB free; 5.93 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

ybelkada

Feb 28, 2024

Hi @shiwanglai
thanks for the issue ! can you share the snippet you're using ?

vince62s

Mar 6, 2024

•

edited Mar 6, 2024

@ybelkada
~/nlp/lm-evaluation-harness$ python lm_eval/main.py --model=hf --model_args pretrained=google/gemma-2b,load_in_4bit=True --tasks wikitext --batch_size 1
is going OOM not sure what's going on.
same
~/nlp/lm-evaluation-harness$ python lm_eval/main.py --model=hf --model_args pretrained=google/gemma-2b --tasks wikitext --batch_size 1

same with gemma-7b:

File "/home/vincent/miniconda3/envs/pt2.1.0/lib/python3.11/site-packages/transformers/models/gemma/modeling_gemma.py", line 1088, in forward
logits = logits.float()
^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.81 GiB. GPU 0 has a total capacity of 23.67 GiB of which 6.05 GiB is free. Including non-PyTorch memory, this process has 15.88 GiB memory in use. Of the allocated memory 13.06 GiB is allocated by PyTorch, and 2.52 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

vince62s

Mar 6, 2024

I reduced the max_length but there is still issues with gemma-7b (and gemma-2b is much higher than phi-2)

hf (pretrained=google/gemma-7b,max_length=512), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1

Tasks	Version	Filter	n-shot	Metric	Value		Stderr
wikitext	2	none	None	word_perplexity	42455038.3994	±	N/A
		none	None	byte_perplexity	26.6969	±	N/A
		none	None	bits_per_byte	4.7386	±	N/A

hf (pretrained=google/gemma-7b-it,max_length=512), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1

Tasks	Version	Filter	n-shot	Metric	Value		Stderr
wikitext	2	none	None	word_perplexity	1795.5652	±	N/A
		none	None	byte_perplexity	4.0602	±	N/A
		none	None	bits_per_byte	2.0216	±	N/A

hf (pretrained=google/gemma-7b,max_length=256), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1

Tasks	Version	Filter	n-shot	Metric	Value		Stderr
wikitext	2	none	None	word_perplexity	41037962.2523	±	N/A
		none	None	byte_perplexity	26.5280	±	N/A
		none	None	bits_per_byte	4.7294	±	N/A

hf (pretrained=google/gemma-2b,max_length=512), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1

Tasks	Version	Filter	n-shot	Metric	Value		Stderr
wikitext	2	none	None	word_perplexity	55.9289	±	N/A
		none	None	byte_perplexity	2.1223	±	N/A
		none	None	bits_per_byte	1.0857	±	N/A

hf (pretrained=google/gemma-2b-it,max_length=512), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1

Tasks	Version	Filter	n-shot	Metric	Value		Stderr
wikitext	2	none	None	word_perplexity	242.5852	±	N/A
		none	None	byte_perplexity	2.7924	±	N/A
		none	None	bits_per_byte	1.4815	±	N/A

glorgao

Aug 28, 2024

Actually, I faced OOM problem when using the DPO trainer for funetuning Gemma-2-2b-it, with 40G memory GPU and batchsize=2. Interesting.

GopiUppari

Google org Jan 23

Hi,

Apologies for the delay.

I have successfully reproduced the issue. To resolve it, please enable gradient checkpointing and reduce the batch size and sequence length. For more details, kindly refer to this gist file.

Thank you.

vpakarinen

Jan 23

•

edited Jan 23

Yeah, that's annoying error.

GopiUppari

Google org 3 days ago

Hi @vpakarinen ,

Please follow the provided gist file for solving the error, if you still facing an issue. Could you please let us know.

Thank you.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment