Spaces:
Runtime error
Runtime error
A newer version of the Gradio SDK is available:
5.16.1
vLLM Integration
You can use vLLM as an optimized worker implementation in FastChat. It offers advanced continuous batching and a much higher (~10x) throughput. See the supported models here.
Instructions
Install vLLM.
pip install vllm
When you launch a model worker, replace the normal worker (
fastchat.serve.model_worker
) with the vLLM worker (fastchat.serve.vllm_worker
). All other commands such as controller, gradio web server, and OpenAI API server are kept the same.python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-7b-v1.3