Is it possible to run this model on a V100?
#77
by
datenbergwerk
- opened
I've been trying for a few days now to get this model to run on a V100 GPU.
My Problems have been this:
- The original weights are saved in BP16 format, which cannot be processed by V100 -> I chose a Quantized Model from this hf repo, to circumvent the data type issue.
- I am trying to run this model on vllm now, but it requires flash-attention-2, which itself requires at least an A100 GPU.
- I've also tried quantizing the model on the fly with BitsAndBytes, but somehow i get errors saying that "torch.float16" cannot be imported.
Has anyone tried (and succeeded) running this model on V100 GPUs? I have not tried ollama yet and it seems to be my last chance. Note, I dont have issues with VRAM or anything, just that the mistral_inference package gives issues with the original dtype (BP16, i believe it was called), transformers with BnB Quant gives torch errors and vllm needs dependencies which require a newer GPU generation.
Grateful for any hints and help or alternative routes to getting this to run locally on old stuff.
I have this running on a 32gb sxm2 v100 using oobabooga, transformers loader works with full weights