Molmo-7B-D-0924 OOM on A100 80GB using Quick Start code
Using quick start code from https://huggingface.co./allenai/Molmo-7B-O-0924 with same input image, got OOM using an A100 80GB gpu. Can you provide a test code that can run with A100 80GB? Runnable on 40GB is better, thanks
Run with with torch.autocast("cuda", enabled=True, dtype=torch.bfloat16):
Run with
with torch.autocast("cuda", enabled=True, dtype=torch.bfloat16):
Thanks for the tip! This enabled me to get this running on a 4090 (24GB VRAM) on Windows. I wanted to share my solution for anyone else who might be running into this issue.
processor = AutoProcessor.from_pretrained(
'allenai/Molmo-7B-D-0924',
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map='auto'
)
model = AutoModelForCausalLM.from_pretrained(
'allenai/Molmo-7B-D-0924',
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map='auto'
)
this enables loading the full model in VRAM and still have plenty left for inference.
prior to calling processor.process
I added:
with torch.no_grad():
with torch.cuda.amp.autocast(dtype=torch.bfloat16):
(the no_grad was a suggestion from o1-preview for memory savings, I'm not sure if its needed but it seems to work!)
@mw44
I forgot to mention the bfloat16 weight loading, thanks for your comment :) no_grad
is always nice to have, saved me a ton of VRAM for other transformers
(in this case, the generate_from_batch
already has no_grad
implemented so you can leave it out, but it's good practice)
@mw44 I forgot to mention the bfloat16 weight loading, thanks for your comment :)
no_grad
is always nice to have, saved me a ton of VRAM for other transformers
(in this case, thegenerate_from_batch
already hasno_grad
implemented so you can leave it out, but it's good practice)
Any idea what amount of VRAM would be required for the 72B model if using bf16?
@mw44 I don't know exact numbers but given that llama 70b takes 148GB you could test Molmo 72B on a 2x A100 cloud node