Spaces:
Running
Running
I can't run any of the bnb-4bit quants with TextGenerationInference
#6
by
v3ss0n
- opened
here is the options i had used :
"--quantize bitsandbytes-fp4 --max-input-tokens 30000 --sharded true --num-shard 2"
docker compose file
text-generation-inference:
image: ghcr.io/huggingface/text-generation-inference:3.1.0
environment:
- HF_TOKEN=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
# - MODEL_ID=Qwen/Qwen2.5-Coder-7B-Instruct-AWQ
# - MODEL_ID=mistralai/Mistral-Small-24B-Instruct-2501
# - MODEL_ID=Qwen/Qwen2.5-Coder-32B-Instruct-AWQ
# - MODEL_ID=avoroshilov/DeepSeek-R1-Distill-Qwen-32B-GPTQ_4bit-128g
# - MODEL_ID=Valdemardi/DeepSeek-R1-Distill-Qwen-32B-AWQ
# - MODEL_ID=Qwen/Qwen2.5-Coder-32B-Instruct-GPTQ-Int4
# - MODEL_ID=unsloth/Qwen2.5-Coder-32B-bnb-4bit
# - MODEL_ID=unsloth/Mistral-Small-24B-Instruct-2501-bnb-4bit
- MODEL_ID=unsloth/DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit
# - SHARDED=true
# - SHARDS=2
# - QUANTIZED=bitsandbytes
ports:
- "0.0.0.0:8099:80"
restart: "unless-stopped"
command: "--quantize bitsandbytes-fp4 --max-input-tokens 30000 --sharded true --num-shard 2"
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0', '1']
capabilities: [gpu]
shm_size: '90g'
volumes:
- ~/.hf-docker-data:/data
networks:
- llmhost
Error :
text-generation-inference-1 | [rank1]: AssertionError: The choosen size 1 is not compatible with sharding on 2 shards rank=1
I also opened an issue at TGI . Not sure which side have the problem
https://github.com/huggingface/text-generation-inference/issues/3005
Thanks, honestly I have never seen the error before - but please note you are using our dynamic quant which might be supported. Instead use the basic BNB version