Start on cpu with vllm.

#1
by kuliev-vitaly - opened

How to start model on cpu with docker?
Is it possible to start model on gpu and offload most of layers to RAM?
I have server with epyc, 512gb ram and 4x3090.

Open Platform for Enterprise AI org
edited 2 days ago

1 The main branch is a standard awq format. I guess vLLM should could run the model via the cpu backend like ipex.

2 I’m not very familiar with vLLM, but to my knowledge, transformers does not support this for INT4 models. However, adding hardcoded support for specific models shouldn’t be too difficult—it just requires some code modifications.

Sign up or log in to comment