Has anyone achieved a speed-up with this model?

#3
by RonanMcGovern - opened

I have tested with:

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --speculative-algo NEXTN --speculative-draft SGLang/DeepSeek-R1-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --trust-remote-code --tp 8 --disable-radix --mem-fraction-static 0.8 --max-running-requests 64 --host 0.0.0.0 --port 8000

and then without speculation:

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --trust-remote-code --tp 8 --disable-radix --mem-fraction-static 0.8 --max-running-requests 64 --host 0.0.0.0 --port 8000

and I am testing using llm perf:

export OPENAI_API_KEY="EMPTY"
export OPENAI_API_BASE="https://yrgtw0xi6ehsy4-8000.proxy.runpod.net/v1"

uv run python token_benchmark_ray.py \
--model "deepseek-ai/DeepSeek-R1" \
--mean-input-tokens 1000 \
--stddev-input-tokens 50 \
--mean-output-tokens 100 \
--stddev-output-tokens 50 \
--max-num-completed-requests 25 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api openai \
--additional-sampling-params '{}'

and the results I get are very similar in both cases:

Concurrency Latency Tokens/s
1 1.5 24
1 (with spec) 1.5 26

Sign up or log in to comment