Inference VRAM Size
Hello,
Thank you for such a tremendous contribution! I have tried running inference on my RTX4090 (24GB vram) to no avail so I used TheBloke's rendition of GGML and GPTQ which work great but verrrrry slow. Which is in direct contrast to your starchat playground which is lightning fast...
I would like to try inference with this repos (native) weights on a GPU to get somewhere in the ballpark of the speed of your playground but how many GB do I need? Do I need to rent like an A100 80?
Ditto. I have the same question.
I'm running it on an A100 80 and most of the time it's using 30GB of VRAM, peaking at 48GB.
@valdanito thank you
If you want to safe money, you should import it in 4-bit mode you need only 10gb of GPU RAM
More info: Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/starchat-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/starchat-beta",load_in_4bit=True, device_map="auto")
@Maxrubino what versions of related quantization dependencies are you running? I get this exception on the last line:
TypeError: GPTBigCodeForCausalLM.__init__() got an unexpected keyword argument 'load_in_4bit'
transformers==4.30.2