Start on cpu with vllm.
#1
by
kuliev-vitaly
- opened
How to start model on cpu with docker?
Is it possible to start model on gpu and offload most of layers to RAM?
I have server with epyc, 512gb ram and 4x3090.
1 The main branch is a standard awq format. I guess vLLM should could run the model via the cpu backend like ipex.
2 I’m not very familiar with vLLM, but to my knowledge, transformers does not support this for INT4 models. However, adding hardcoded support for specific models shouldn’t be too difficult—it just requires some code modifications.