Help: CUDA Out of Memory. Hardware requirements.

#147
by zebfreeman - opened

I am trying to load Mixtral8x7b on my local machine to run inference. I am using vLLM to serve the model.
python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --model mistralai/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size 2 --load-format safetensors --enforce-eager --worker-use-ray --gpu-memory-utilization .95

I have also tried FastChat .
python3 -m fastchat.serve.model_worker --model-path zeb-7b-v1.4 --model-name zeb --num-gpus 2 --cpu-offloading
as well as trying --load-8bit

None of these methods worked. vLLM just kills the terminal as the model is almost done downloading its weights. And FastChat produces this error as it is loading the last few checkpoint shards:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 1 has a total capacty of 47.99 GiB of which 32.88 GiB is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 13.22 GiB is allocated by PyTorch, and 17.69 MiB is reserved by PyTorch but unallocated.

My Desktop consists of:
GPU: 2x RTX 6000 Ada 96GB VRAM (48Gb each)
Memory: 128GB RAM
1TB NVMe SSD
Intel i7

Other post answers are very confusing. Is this not enough VRAM or RAM? What do I need to upgrade in order to be able to run Mixtral? I don't want to use a quantized model. What are the minimum VRAM and RAM requirements to download and use the model for my RAG application?

Hi there,
I think you will not be able to run Mixtral unquantized with your current setup. The weights of the model alone are about 95GB, to which you need to add the cuda graphs (or maybe not if you enforce-eager). Extra VRAM will definitely be needed for the 32k token context length. You can in total think of 120GB/140GB of VRAM needed for the unquantized version of Mixtral, so 2xNvidia A100 80GB needed. If you want to try the quantized version, take a look at one of the awq quantization for vLLM use:
https://huggingface.co./TheBloke/Nous-Hermes-2-Mixtral-8x7B-DPO-AWQ

and run it as
python3 -m vllm.entrypoints.openai.api_server --model TheBloke/Nous-Hermes-2-Mixtral-8x7B-DPO-AWQ --quantization awq --tensor-parallel-size 2 --host 0.0.0.0

Thank You @SoheylM , That is what I am currently running as an Alternative. I just wanted to be able to get the fulll version running.

Sign up or log in to comment