Is it possible to run this model on a V100?

#77
by datenbergwerk - opened

I've been trying for a few days now to get this model to run on a V100 GPU.
My Problems have been this:

  • The original weights are saved in BP16 format, which cannot be processed by V100 -> I chose a Quantized Model from this hf repo, to circumvent the data type issue.
  • I am trying to run this model on vllm now, but it requires flash-attention-2, which itself requires at least an A100 GPU.
  • I've also tried quantizing the model on the fly with BitsAndBytes, but somehow i get errors saying that "torch.float16" cannot be imported.

Has anyone tried (and succeeded) running this model on V100 GPUs? I have not tried ollama yet and it seems to be my last chance. Note, I dont have issues with VRAM or anything, just that the mistral_inference package gives issues with the original dtype (BP16, i believe it was called), transformers with BnB Quant gives torch errors and vllm needs dependencies which require a newer GPU generation.

Grateful for any hints and help or alternative routes to getting this to run locally on old stuff.

I have this running on a 32gb sxm2 v100 using oobabooga, transformers loader works with full weights

Sign up or log in to comment