How to use the bnb-4bit model?

#4
by neoragex2002 - opened

Are there detailed examples and tutorials available? Thanks!

Here's the rundown of how I got it working locally for me.

Running Unsloth QwQ-32B with Dynamic Quantization (vLLM)

Prerequisites

  1. Hardware: At least 2x NVIDIA GPUs with 24GB VRAM each (total 48GB).
  2. Software: Linux OS, Python 3.9-3.12, NVIDIA CUDA drivers.

Step 1: Install Dependencies

Using uv (Fast Python Env Manager)

  • What is uv? A lightweight tool for managing Python environments. Install instructions here.
  • Create a virtual environment:
    uv venv vllm-env --python 3.12 --seed
    source vllm-env/bin/activate
    

Install vLLM & BitsAndBytes

uv pip install vllm bitsandbytes>=0.45.0

Step 2: Run OpenAI-Compatible API Server

Use this command to start the server with dynamic quantization and GPU parallelism:

python -m vllm.entrypoints.openai.api_server \
    --model unsloth/QwQ-32B-unsloth-bnb-4bit \
    --quantization bitsandbytes \
    --load-format bitsandbytes \
    --tensor-parallel-size 2 \
    --max-model-len 4096

Key Parameters Explained:

  • --quantization bitsandbytes: Enables 4-bit quantization to reduce VRAM usage.
  • --load-format bitsandbytes: Specifies the quantization format.
  • --tensor-parallel-size 2: Distributes the model across your 2 GPUs.
  • --max-model-len 4096: Sets the maximum context length (required for QwQ-32B).

Troubleshooting Tips

  1. Out of Memory (OOM) Errors:

    • Ensure you're using both GPUs with --tensor-parallel-size 2.
    • Verify your GPUs have β‰₯24GB VRAM each (total 48GB).
    • Reduce --max-model-len if issues persist (though 4096 is required for full context).
  2. Distributed Inference Notes:

    • For multi-GPU setups, vLLM automatically handles tensor parallelism.
    • If using >2 GPUs or multiple nodes, adjust --tensor-parallel-size and follow vLLM's distributed docs.
  3. Check GPU Usage:

    nvidia-smi  # Ensure GPUs are recognized and not in use.
    

Quick Usage Example

Once the server runs (default URL: http://localhost:8000), test with curl:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/QwQ-32B-unsloth-bnb-4bit",
    "prompt": "Explain quantum computing in simple terms.",
    "max_tokens": 100
  }'

Resources


Here's the rundown of how I got it working locally for me.

Running Unsloth QwQ-32B with Dynamic Quantization (vLLM)

Prerequisites

  1. Hardware: At least 2x NVIDIA GPUs with 24GB VRAM each (total 48GB).
  2. Software: Linux OS, Python 3.9-3.12, NVIDIA CUDA drivers.

Step 1: Install Dependencies

Using uv (Fast Python Env Manager)

  • What is uv? A lightweight tool for managing Python environments. Install instructions here.
  • Create a virtual environment:
    uv venv vllm-env --python 3.12 --seed
    source vllm-env/bin/activate
    

Install vLLM & BitsAndBytes

uv pip install vllm bitsandbytes>=0.45.0

Step 2: Run OpenAI-Compatible API Server

Use this command to start the server with dynamic quantization and GPU parallelism:

python -m vllm.entrypoints.openai.api_server \
    --model unsloth/QwQ-32B-unsloth-bnb-4bit \
    --quantization bitsandbytes \
    --load-format bitsandbytes \
    --tensor-parallel-size 2 \
    --max-model-len 4096

Key Parameters Explained:

  • --quantization bitsandbytes: Enables 4-bit quantization to reduce VRAM usage.
  • --load-format bitsandbytes: Specifies the quantization format.
  • --tensor-parallel-size 2: Distributes the model across your 2 GPUs.
  • --max-model-len 4096: Sets the maximum context length (required for QwQ-32B).

Troubleshooting Tips

  1. Out of Memory (OOM) Errors:

    • Ensure you're using both GPUs with --tensor-parallel-size 2.
    • Verify your GPUs have β‰₯24GB VRAM each (total 48GB).
    • Reduce --max-model-len if issues persist (though 4096 is required for full context).
  2. Distributed Inference Notes:

    • For multi-GPU setups, vLLM automatically handles tensor parallelism.
    • If using >2 GPUs or multiple nodes, adjust --tensor-parallel-size and follow vLLM's distributed docs.
  3. Check GPU Usage:

    nvidia-smi  # Ensure GPUs are recognized and not in use.
    

Quick Usage Example

Once the server runs (default URL: http://localhost:8000), test with curl:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/QwQ-32B-unsloth-bnb-4bit",
    "prompt": "Explain quantum computing in simple terms.",
    "max_tokens": 100
  }'

Resources


Really appreciate it, thank you very much!

Is there any possible to run it locally by using unsloth FastLanguageModel?
I tried hard, but the decoded token is in an infinite loop. Do I need to define LogitsProcessor myself?
Any advice and tutorial available plz. Thanks!

Sign up or log in to comment