quantization gptq_marlin (not found gptq_marlin) not work. , remove it. work.

#7
by linpan - opened

env:: vllm0.5.3.post

docker run --runtime nvidia --gpus all --ipc=host -p 8000:8000
-v hf_cache:/root/.cache/huggingface
vllm/vllm-openai:latest
--model hugging-quants/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4
--quantization gptq_marlin
--tensor-parallel-size 8
--max-model-len 4096

Hugging Quants org

Hi here @linpan could you please add the error or elaborate more on why it fails? Thanks!

--quantization gptq_marlin not found quantization method.

remove "--quantization gptq_marlin" is working. vllm0.5.3 support gptq_marlin

Hugging Quants org

Well, that's odd, since it should support gptq_marlin as per https://docs.vllm.ai/en/v0.5.3/models/engine_args.html

--quantization, -q
Possible choices: aqlm, awq, deepspeedfp, fp8, fbgemm_fp8, marlin, gptq_marlin_24, gptq_marlin, awq_marlin, gptq, squeezellm, compressed-tensors, bitsandbytes, None

Method used to quantize the weights. If None, we first check the quantization_config attribute in the model config file. If that is None, we assume the model weights are not quantized and use dtype to determine the data type of the weights.

I guess that those will be used by default anyway, as that's more optimal, but still weird that gptq_marlin doesn't work, could you please fill an issue at https://github.com/vllm-project/vllm/issues? They will be able to address that better 🤗

case1: remove quantization para. ask "who are you? "
has problem: input not stop until model max_tokens.

CleanShot 2024-07-31 at 09.32.50.png

CleanShot 2024-07-31 at 09.33.31.png
still input not stop.
use meta-405b-fb8, 3s get full ans.

case2: add quantization gptq_marlin

CleanShot 2024-07-31 at 09.36.41.png

your quantization model has problem.

From this model config, it is not quantized as marlin_format. So this should be the reason.

Hugging Quants org

If that's the case, then do you mind opening a PR here to replace the gptq_marlin line within the vLLM command with gptq instead? Thanks a lot 🤗

If you want marlin, you're probably better off using

https://huggingface.co./neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16

It performed about twice as fast on my setup.

Sign up or log in to comment