I can't run any of the bnb-4bit quants with TextGenerationInference

#6
by v3ss0n - opened

here is the options i had used :

"--quantize bitsandbytes-fp4 --max-input-tokens 30000 --sharded true --num-shard 2"

docker compose file

  text-generation-inference:
    image: ghcr.io/huggingface/text-generation-inference:3.1.0
    environment:
      - HF_TOKEN=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
      # - MODEL_ID=Qwen/Qwen2.5-Coder-7B-Instruct-AWQ
      # - MODEL_ID=mistralai/Mistral-Small-24B-Instruct-2501
      # - MODEL_ID=Qwen/Qwen2.5-Coder-32B-Instruct-AWQ
      # - MODEL_ID=avoroshilov/DeepSeek-R1-Distill-Qwen-32B-GPTQ_4bit-128g
      # - MODEL_ID=Valdemardi/DeepSeek-R1-Distill-Qwen-32B-AWQ
      # - MODEL_ID=Qwen/Qwen2.5-Coder-32B-Instruct-GPTQ-Int4
      # - MODEL_ID=unsloth/Qwen2.5-Coder-32B-bnb-4bit
      # - MODEL_ID=unsloth/Mistral-Small-24B-Instruct-2501-bnb-4bit
      - MODEL_ID=unsloth/DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit
      # - SHARDED=true
      # - SHARDS=2
      # - QUANTIZED=bitsandbytes
    ports:
      - "0.0.0.0:8099:80"
    restart: "unless-stopped"
    command: "--quantize bitsandbytes-fp4 --max-input-tokens 30000 --sharded true --num-shard 2"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0', '1']
              capabilities: [gpu]
    shm_size: '90g'
    volumes:
      - ~/.hf-docker-data:/data
    networks:
      - llmhost

Error :

text-generation-inference-1  | [rank1]: AssertionError: The choosen size 1 is not compatible with sharding on 2 shards rank=1

I also opened an issue at TGI . Not sure which side have the problem

https://github.com/huggingface/text-generation-inference/issues/3005

Unsloth AI org

Thanks, honestly I have never seen the error before - but please note you are using our dynamic quant which might be supported. Instead use the basic BNB version

Sign up or log in to comment