GGUF, AWQ, GPTQ ?

#1
by 0xbcn - opened

Sorry if you see this question a lot, but what exactly is the difference of those 3 file formats?
Or better asked, what should I use if I have an GPU available for transforming and prefer the most performance over "everything else"?

For now I opted with GGUF and the llama.ccp implementation, assuming C++ performs well in this area.

Gguf is best for cpu and mac

Gptq is best for gpu with exllama and good for servers since it can be used with tgi

Awq is highest quality and best for servers since it can be used with vllm.

So most likely gptq if you have gpu which can fit the whole model, gguf If you have Mac or cpu or can not fit the model in gpu, and awq for server inference with vllm

Sign up or log in to comment