How to use the bnb-4bit model?

by neoragex2002 - opened 2 days ago

Discussion

neoragex2002

2 days ago

Are there detailed examples and tutorials available? Thanks!

abyssalaxioms

1 day ago

•

edited 1 day ago

Here's the rundown of how I got it working locally for me.

Running Unsloth QwQ-32B with Dynamic Quantization (vLLM)

Prerequisites

Hardware: At least 2x NVIDIA GPUs with 24GB VRAM each (total 48GB).
Software: Linux OS, Python 3.9-3.12, NVIDIA CUDA drivers.

Step 1: Install Dependencies

Using `uv` (Fast Python Env Manager)

What is uv? A lightweight tool for managing Python environments. Install instructions here.

Create a virtual environment:

uv venv vllm-env --python 3.12 --seed
source vllm-env/bin/activate

Install vLLM & BitsAndBytes

uv pip install vllm bitsandbytes>=0.45.0

Step 2: Run OpenAI-Compatible API Server

Use this command to start the server with dynamic quantization and GPU parallelism:

python -m vllm.entrypoints.openai.api_server \
    --model unsloth/QwQ-32B-unsloth-bnb-4bit \
    --quantization bitsandbytes \
    --load-format bitsandbytes \
    --tensor-parallel-size 2 \
    --max-model-len 4096

Key Parameters Explained:

--quantization bitsandbytes: Enables 4-bit quantization to reduce VRAM usage.
--load-format bitsandbytes: Specifies the quantization format.
--tensor-parallel-size 2: Distributes the model across your 2 GPUs.
--max-model-len 4096: Sets the maximum context length (required for QwQ-32B).

Troubleshooting Tips

Out of Memory (OOM) Errors:
- Ensure you're using both GPUs with --tensor-parallel-size 2.
- Verify your GPUs have ≥24GB VRAM each (total 48GB).
- Reduce --max-model-len if issues persist (though 4096 is required for full context).
Distributed Inference Notes:
- For multi-GPU setups, vLLM automatically handles tensor parallelism.
- If using >2 GPUs or multiple nodes, adjust --tensor-parallel-size and follow vLLM's distributed docs.

Check GPU Usage:

nvidia-smi  # Ensure GPUs are recognized and not in use.

Quick Usage Example

Once the server runs (default URL: http://localhost:8000), test with curl:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/QwQ-32B-unsloth-bnb-4bit",
    "prompt": "Explain quantum computing in simple terms.",
    "max_tokens": 100
  }'

Resources

Unsloth Tutorial: How to Run QwQ-32B Effectively
vLLM Docs: Quantization Guide | Distributed Serving

neoragex2002

1 day ago

Here's the rundown of how I got it working locally for me.

Running Unsloth QwQ-32B with Dynamic Quantization (vLLM)

Prerequisites

Hardware: At least 2x NVIDIA GPUs with 24GB VRAM each (total 48GB).

Software: Linux OS, Python 3.9-3.12, NVIDIA CUDA drivers.

Step 1: Install Dependencies

Using uv (Fast Python Env Manager)
What is uv? A lightweight tool for managing Python environments. Install instructions here.
Create a virtual environment:
uv venv vllm-env --python 3.12 --seed
source vllm-env/bin/activate
Install vLLM & BitsAndBytes
uv pip install vllm bitsandbytes>=0.45.0
Step 2: Run OpenAI-Compatible API Server

Use this command to start the server with dynamic quantization and GPU parallelism:
python -m vllm.entrypoints.openai.api_server \
    --model unsloth/QwQ-32B-unsloth-bnb-4bit \
    --quantization bitsandbytes \
    --load-format bitsandbytes \
    --tensor-parallel-size 2 \
    --max-model-len 4096
Key Parameters Explained:

--quantization bitsandbytes: Enables 4-bit quantization to reduce VRAM usage.

--load-format bitsandbytes: Specifies the quantization format.

--tensor-parallel-size 2: Distributes the model across your 2 GPUs.

--max-model-len 4096: Sets the maximum context length (required for QwQ-32B).

Troubleshooting Tips
Out of Memory (OOM) Errors:

Ensure you're using both GPUs with --tensor-parallel-size 2.

Verify your GPUs have ≥24GB VRAM each (total 48GB).

Reduce --max-model-len if issues persist (though 4096 is required for full context).

Distributed Inference Notes:

For multi-GPU setups, vLLM automatically handles tensor parallelism.

If using >2 GPUs or multiple nodes, adjust --tensor-parallel-size and follow vLLM's distributed docs.
Check GPU Usage:
nvidia-smi  # Ensure GPUs are recognized and not in use.
Quick Usage Example

Once the server runs (default URL: http://localhost:8000), test with curl:
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/QwQ-32B-unsloth-bnb-4bit",
    "prompt": "Explain quantum computing in simple terms.",
    "max_tokens": 100
  }'
Resources

Unsloth Tutorial: How to Run QwQ-32B Effectively

vLLM Docs: Quantization Guide | Distributed Serving

Really appreciate it, thank you very much!

neoragex2002

1 day ago

Is there any possible to run it locally by using unsloth FastLanguageModel?
I tried hard, but the decoded token is in an infinite loop. Do I need to define LogitsProcessor myself?
Any advice and tutorial available plz. Thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

How to use the bnb-4bit model?

Running Unsloth QwQ-32B with Dynamic Quantization (vLLM)

Prerequisites

Step 1: Install Dependencies

Using uv (Fast Python Env Manager)

Install vLLM & BitsAndBytes

Step 2: Run OpenAI-Compatible API Server

Key Parameters Explained:

Troubleshooting Tips

Quick Usage Example

Resources

Running Unsloth QwQ-32B with Dynamic Quantization (vLLM)

Prerequisites

Step 1: Install Dependencies

Using uv (Fast Python Env Manager)

Install vLLM & BitsAndBytes

Step 2: Run OpenAI-Compatible API Server

Key Parameters Explained:

Troubleshooting Tips

Quick Usage Example

Resources

Using `uv` (Fast Python Env Manager)

Using `uv` (Fast Python Env Manager)