Qwen
/

Text Generation
GGUF
English
chat
Inference Endpoints
conversational

[Help] How much performance loss is there for Q2 and Q3 quantization? Q2和Q3量化有多少性能损失?

#3
by TrumpMAGA14061946 - opened

I have an NVIDIA RTX 4060 8G graphics card, and my computer has 64GB of RAM. On Windows 11, the maximum shared VRAM is 0.5×64 + 8 = 40GB, which allows me to run some larger models. However, due to the lower computational power of the 4060, when I run the 32B-Q4 quantized large model, the token generation speed is even less than 1 token/s. Therefore, I want to try smaller quantizations. I remember that for larger models, the performance loss of Q2 and Q3 won't be as significant as with 1.5B or 7B models. I want to know exactly how much loss there will be. (I know that for my hardware conditions, using Q4 or Q5 quantization on 7B models is the best, but I want to try the latest models.)
我有一块 NVIDIA RTX 4060 8G 显卡,电脑运行内存有64GB,在Windows 11系统上的共享显存上限是0.5*64 + 8 = 40GB,这让我可以运行一些尺寸较大的模型,但由于4060算力较低,当我运行32B-Q4量化的大模型时,token生成速度只有2-3token/s,因此我想尝试更小的量化,我记得对于较大尺寸的模型来说,Q2和Q3的性能损失不会有1.5B或7B那么大,我想知道具体会有多少损失?(我知道对于我这样的硬件条件,使用7B模型的Q4或Q5量化是最好的,但我想尝试一下最新模型)

Sign up or log in to comment