Inference speed is extremly slow with FastChat

#22
by oximi123 - opened

I use FastChat to deploy CodeLlama-7b-Instruct-hf on a A800-80GB server. The inference speed is extremly slow (It runs more than ten minutes without producing the response for a request). Any suggestion on how to solve this problem?

Here is how I deploy it with FastChat:

python -m fastchat.serve.controller
python -m fastchat.serve.model_worker --model-path /home/user/botao/CodeLlama-7b-Instruct-hf
python -m fastchat.serve.openai_api_server --host localhost --port 8000

Did you try with VLLM endpoint ?

Sign up or log in to comment