How to use the bnb-4bit model?
Are there detailed examples and tutorials available? Thanks!
Here's the rundown of how I got it working locally for me.
Running Unsloth QwQ-32B with Dynamic Quantization (vLLM)
Prerequisites
- Hardware: At least 2x NVIDIA GPUs with 24GB VRAM each (total 48GB).
- Software: Linux OS, Python 3.9-3.12, NVIDIA CUDA drivers.
Step 1: Install Dependencies
Using uv
(Fast Python Env Manager)
- What is
uv
? A lightweight tool for managing Python environments. Install instructions here. - Create a virtual environment:
uv venv vllm-env --python 3.12 --seed source vllm-env/bin/activate
Install vLLM & BitsAndBytes
uv pip install vllm bitsandbytes>=0.45.0
Step 2: Run OpenAI-Compatible API Server
Use this command to start the server with dynamic quantization and GPU parallelism:
python -m vllm.entrypoints.openai.api_server \
--model unsloth/QwQ-32B-unsloth-bnb-4bit \
--quantization bitsandbytes \
--load-format bitsandbytes \
--tensor-parallel-size 2 \
--max-model-len 4096
Key Parameters Explained:
--quantization bitsandbytes
: Enables 4-bit quantization to reduce VRAM usage.--load-format bitsandbytes
: Specifies the quantization format.--tensor-parallel-size 2
: Distributes the model across your 2 GPUs.--max-model-len 4096
: Sets the maximum context length (required for QwQ-32B).
Troubleshooting Tips
Out of Memory (OOM) Errors:
- Ensure you're using both GPUs with
--tensor-parallel-size 2
. - Verify your GPUs have β₯24GB VRAM each (total 48GB).
- Reduce
--max-model-len
if issues persist (though 4096 is required for full context).
- Ensure you're using both GPUs with
Distributed Inference Notes:
- For multi-GPU setups, vLLM automatically handles tensor parallelism.
- If using >2 GPUs or multiple nodes, adjust
--tensor-parallel-size
and follow vLLM's distributed docs.
Check GPU Usage:
nvidia-smi # Ensure GPUs are recognized and not in use.
Quick Usage Example
Once the server runs (default URL: http://localhost:8000
), test with curl
:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "unsloth/QwQ-32B-unsloth-bnb-4bit",
"prompt": "Explain quantum computing in simple terms.",
"max_tokens": 100
}'
Resources
- Unsloth Tutorial: How to Run QwQ-32B Effectively
- vLLM Docs: Quantization Guide | Distributed Serving
Here's the rundown of how I got it working locally for me.
Running Unsloth QwQ-32B with Dynamic Quantization (vLLM)
Prerequisites
- Hardware: At least 2x NVIDIA GPUs with 24GB VRAM each (total 48GB).
- Software: Linux OS, Python 3.9-3.12, NVIDIA CUDA drivers.
Step 1: Install Dependencies
Using
uv
(Fast Python Env Manager)
- What is
uv
? A lightweight tool for managing Python environments. Install instructions here.- Create a virtual environment:
uv venv vllm-env --python 3.12 --seed source vllm-env/bin/activate
Install vLLM & BitsAndBytes
uv pip install vllm bitsandbytes>=0.45.0
Step 2: Run OpenAI-Compatible API Server
Use this command to start the server with dynamic quantization and GPU parallelism:
python -m vllm.entrypoints.openai.api_server \ --model unsloth/QwQ-32B-unsloth-bnb-4bit \ --quantization bitsandbytes \ --load-format bitsandbytes \ --tensor-parallel-size 2 \ --max-model-len 4096
Key Parameters Explained:
--quantization bitsandbytes
: Enables 4-bit quantization to reduce VRAM usage.--load-format bitsandbytes
: Specifies the quantization format.--tensor-parallel-size 2
: Distributes the model across your 2 GPUs.--max-model-len 4096
: Sets the maximum context length (required for QwQ-32B).
Troubleshooting Tips
Out of Memory (OOM) Errors:
- Ensure you're using both GPUs with
--tensor-parallel-size 2
.- Verify your GPUs have β₯24GB VRAM each (total 48GB).
- Reduce
--max-model-len
if issues persist (though 4096 is required for full context).Distributed Inference Notes:
- For multi-GPU setups, vLLM automatically handles tensor parallelism.
- If using >2 GPUs or multiple nodes, adjust
--tensor-parallel-size
and follow vLLM's distributed docs.Check GPU Usage:
nvidia-smi # Ensure GPUs are recognized and not in use.
Quick Usage Example
Once the server runs (default URL:
http://localhost:8000
), test withcurl
:
curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "unsloth/QwQ-32B-unsloth-bnb-4bit", "prompt": "Explain quantum computing in simple terms.", "max_tokens": 100 }'
Resources
- Unsloth Tutorial: How to Run QwQ-32B Effectively
- vLLM Docs: Quantization Guide | Distributed Serving
Really appreciate it, thank you very much!
Is there any possible to run it locally by using unsloth FastLanguageModel?
I tried hard, but the decoded token is in an infinite loop. Do I need to define LogitsProcessor myself?
Any advice and tutorial available plz. Thanks!