Running in Multi-gpu's
I have 4 gpu's.
I know that we can run the model on multiple gpu's using device="auto", but how to convert the input token's to load on multiple gpu's.
This way we can only load onto one gpu
inputs = inputs.to("cuda") [inputs will be on cuda:0]
I want lo load them on all GPU's.
Example:
cuda: 0,1,2,3
Thanks for the issue! For me, it is unclear to me what is the motivation behind this. When you load the model across multiple GPUs through device_map="auto"
, instead of having one replica of the model on each GPU, your model will be sharded across all these GPUs. E.g. the first layer will be loaded on GPU:0, the second on GPU:1 and so on. To perform text generation with such a model you need to make sure your input is on the same device as the first layers of the model, hence the inputs = inputs.to("cuda")
and placing it on cuda:0
, and the computation will be done sequentially, meaning while a GPU is being used, all other GPUs are kept idle.
If you want to parallelize the text generation procedure by let's say loading one copy of the model per GPU, you can pass device_map={"": PartialState().process_index}
(after importing PartialState
from accelerate
), that way the model will be entirely loaded on the device PartialState().process_index
which should correspond to the index of the current GPU. After that you just need to set your input to that device
device_index = PartialState().process_index
inputs = inputs.to(device_index)
However I doubt Mixtral will fit on a single GPU unless you use the 2bit version of the model, e.g.: https://huggingface.co./BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x16-hf-test-dispatch from @BlackSamorez (you need to pip install aqlm and install transformers from source until the next release)
Hi @ybelkada , Apologies for delayed response.
I will rephrase the question, the concept of GPU allocation in the above context was assumed wrong. Apologies for confusion.
Updated question:
I have 4 GPU's, Each GPU has ~ 40 GB:
GPU 0: 40 GB
GPU 1: 40 GB
GPU 2: 40 GB
GPU 3: 40 GB
As per the hugging face blog for loading large models here: https://huggingface.co./docs/accelerate/en/concept_guides/big_model_inference
"balanced_low_0" evenly splits the model on all GPUs except the first one, and only puts on GPU 0 what does not fit on the others. This option is great when you need to use GPU 0 for some processing of the outputs, like when using the generate function for Transformers models
Using balanced_low_0 for text-generation, I loaded the Mixtral to the onto all GPU's except the GPU: 0. [i.e. GPU: 0 is saved for model.generate() function]
model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="balanced_low_0")
GPU Utilization:
Here we can see, Complete GPU: 0 and Partial GPU: 3 has some memory left.
Now, when i feed a long text input to model generate less than 32k tokens. The CUDA Memory error appears, but this memory error only says there is no enough space on GPU: 0.
inputs = tokenizer(long_text_message, return_tensors="pt").to("cuda")
outputs = model.generate(inputs, max_new_tokens=20) # CUDA error from this part
CUDA Error:
OutOfMemoryError: CUDA out of memory. Tried to allocate 22.80 GiB (GPU 0; 39.59 GiB total capacity; 28.81 GiB already allocated; 6.46 GiB free; 31.74 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CON
My expectation is to utilize the complete GPU: 0 & Partial GPU: 4 for model.generate to process long texts.
Could you please let me know how can I utilize the GPU: 4 on top of GPU: 0 for model.generate()? (or utilize all the remaining GPU resources for model.generate()?)
Hi
@kmukeshreddy
Hmm interesting I see. My guess here is that the text you're passing it so large that the hidden states computed on the first GPU exceeds 40GB. Can you try again by reducing the size of the model by loading it in half-precision or 8-bit / 4-bit precision?
Hi
@ybelkada
, Thank you for the follow up comment.
The issue with the quantized version is, there is a significant performance drop for my task. So, I was looking for the work-around to use the model as-is.