Start on cpu with vllm.

by kuliev-vitaly - opened 4 days ago

4 days ago

How to start model on cpu with docker?
Is it possible to start model on gpu and offload most of layers to RAM?
I have server with epyc, 512gb ram and 4x3090.

cicdatopea

Open Platform for Enterprise AI org 2 days ago

•

edited 2 days ago

1 The main branch is a standard awq format. I guess vLLM should could run the model via the cpu backend like ipex.

2 I’m not very familiar with vLLM, but to my knowledge, transformers does not support this for INT4 models. However, adding hardcoded support for specific models shouldn’t be too difficult—it just requires some code modifications.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment