open-thoughts/OpenThinker-32B · Can this model be run with vllm ?

pip install openai vllm

Launch vllm:
vllm serve "open-thoughts/OpenThinker-32B" --tensor-parallel-size=1 --disable-log-requests --enable-chunked-prefill --enable-prefix-caching --max-num-batched-tokens=16192 --max-model-len=8096 --gpu-memory-utilization=0.93

(you may want to play with the args here)

And you can access it as follows:

import openai
client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

completion = client.chat.completions.create(
    model="open-thoughts/OpenThinker-32B",
    messages=[{"role": "user", "content": "What is 1+1?"}]
)