AttributeError: Model MllamaForConditionalGeneration does not support BitsAndBytes quantization yet.
Hi! I'm trying to run this model in a Nvidia T4 GPU (16 VRAM) with vLLM running in a docker container but I couldn't succeed. Even chaning some of the vLLM params and running the latest vLLM version (v0.6.3.post1), I still got the error:
AttributeError: Model MllamaForConditionalGeneration does not support BitsAndBytes quantization yet.
The command I'm running is as follows:
docker run --runtime nvidia --gpus all
-v ~/.cache/huggingface:/root/.cache/huggingface
--env "HUGGING_FACE_HUB_TOKEN="
-p 8000:8000
--ipc=host
vllm/vllm-openai:v0.6.3.post1
--model unsloth/Llama-3.2-11B-Vision-Instruct-bnb-4bit
--dtype half
--quantization bitsandbytes
--load_format bitsandbytes
--max_model_len 50000
--gpu_memory_utilization 0.99
--trust-remote-code
--enforce-eager
Is there something I'm missing?
Thanks in advance. Cheers!