Request to bring back Q4_1
According to my tests, Q4_1 is the most efficient for the quality, using 20% less energy on my computer than Q4_K_M.
I can believe the energy reduction, but how did you compare the quality, and how does it compare to the Q4_0 that I provide? I'd need a compelling reason to replace Q4_0 by Q4_1, and an even more compelling reason to provide both. The idea behind providing Q4_0 was top provide a fast quant for slow computers. I suspect Q4_1 is simply slower but also a bit higher quality. It certainly is a lot bigger with very little quality increase.
In any case, @nicoboss is currently preparing a quite extensive (quality) benchmark over all quant types (including Q4_1). I will reevaluate all quants based on that as well.
@nicoboss while we are at it, another test I'd like to do on all quants is a speed test, to see how different quants work on cpu vs. gpu, possibly both prompt processing and inferencing. I don't thionk I cna put all this on the model page, but I could link to a guidance page, where people could get an idea of relative speeds vs. quality.
While I never performed any performance specific benchmarks as part of the quant quality measurement project so far, I did perform multiple evaluation benchmarks of each model from which I measured the quant quality. Thanks to the files creation and last modified time of the evaluation result I was able to extract the following quant performance measurements. The measurement includes running all the evaluation tests inside the benchmark. All tests were performed with the model being stored fully in RAM without offloading any layers to the GPU but using a GPU to improve prompt evaluation (-ngl 0). Further keep in mind that those are multiple choose benchmarks and so the result is more heavily skewed towards prompt evaluation than token generation than would be the case in most normal use cases.
Comparison between Q4_0 and Q4_1
Model | MMLU [s] | ARC easy [s] | ARC challenge [s] | Winogrande [s] |
---|---|---|---|---|
Fook-Yi-34B-32K-v1.Q4_0 | 329 | 245 | 38 | 79 |
Fook-Yi-34B-32K-v1.Q4_1 | 368 | 274 | 42 | 87 |
Fook-Yi-34B-32K-v1.i1-Q4_0 | 331 | 270 | 39 | 80 |
Fook-Yi-34B-32K-v1.i1-Q4_1 | 360 | 301 | 42 | 87 |
Conclusion
It seems wrong that Q4_1 performs better than Q4_0. I actually saw the exact opposite. Q4_0 seemed to perform quite a lot better than Q4_1. I monitored the GPU energy consumption during those benchmarks and it was approximately the same no matter the quant so at least in these tests, longer time directly results in more energy consumption for the same task. I do not recommend replacing Q4_0 with Q4_1 from a performance perspective. Keep in mind that I have not considered quality during this comparison.
Future research
I want to come up with some performance tests that are mainly skewed towards token generation instead of prompt evaluation and want to run them CPU only, CPU with offloading computation to GPU (ngl -0), manually offloading layers to GPU, GPU memory RAM overflowing (GGML_CUDA_ENABLE_UNIFIED_MEMORY=1), GPU and using RPC.
Results of all quants
MMLU performance
Filename | Time (seconds) |
---|---|
Fook-Yi-34B-32K-v1.i1-IQ1_S | 181 |
Fook-Yi-34B-32K-v1.i1-IQ2_XXS | 202 |
Fook-Yi-34B-32K-v1.i1-IQ1_M | 218 |
Fook-Yi-34B-32K-v1.i1-IQ2_XS | 225 |
Fook-Yi-34B-32K-v1.i1-IQ2_S | 233 |
Fook-Yi-34B-32K-v1.i1-IQ2_M | 243 |
Fook-Yi-34B-32K-v1.i1-IQ3_XXS | 256 |
Fook-Yi-34B-32K-v1.i1-Q2_K_S | 261 |
Fook-Yi-34B-32K-v1.i1-Q2_K | 264 |
Fook-Yi-34B-32K-v1.i1-IQ3_XS | 268 |
Fook-Yi-34B-32K-v1.Q2_K | 271 |
Fook-Yi-34B-32K-v1.IQ3_XS | 275 |
Fook-Yi-34B-32K-v1.i1-IQ3_S | 279 |
Fook-Yi-34B-32K-v1.i1-Q3_K_S | 282 |
Fook-Yi-34B-32K-v1.IQ3_S | 286 |
Fook-Yi-34B-32K-v1.i1-IQ3_M | 286 |
Fook-Yi-34B-32K-v1.IQ3_M | 288 |
Fook-Yi-34B-32K-v1.Q3_K_S | 295 |
Fook-Yi-34B-32K-v1.i1-Q3_K_M | 302 |
Fook-Yi-34B-32K-v1.Q3_K_M | 315 |
Fook-Yi-34B-32K-v1.i1-IQ4_XS | 319 |
Fook-Yi-34B-32K-v1.i1-Q3_K_L | 321 |
Fook-Yi-34B-32K-v1.IQ4_XS | 323 |
Fook-Yi-34B-32K-v1.Q3_K_L | 325 |
Fook-Yi-34B-32K-v1.Q4_0 | 329 |
Fook-Yi-34B-32K-v1.i1-Q4_0 | 331 |
Fook-Yi-34B-32K-v1.i1-Q4_K_S | 334 |
Fook-Yi-34B-32K-v1.i1-IQ4_NL | 335 |
Fook-Yi-34B-32K-v1.Q4_K_S | 338 |
Fook-Yi-34B-32K-v1.IQ4_NL | 339 |
Fook-Yi-34B-32K-v1.i1-Q4_K_M | 347 |
Fook-Yi-34B-32K-v1.Q4_K_M | 348 |
Fook-Yi-34B-32K-v1.i1-Q4_1 | 360 |
Fook-Yi-34B-32K-v1.Q4_1 | 368 |
Fook-Yi-34B-32K-v1.Q5_0 | 387 |
Fook-Yi-34B-32K-v1.i1-Q5_0 | 388 |
Fook-Yi-34B-32K-v1.i1-Q5_K_S | 388 |
Fook-Yi-34B-32K-v1.Q5_K_S | 394 |
Fook-Yi-34B-32K-v1.i1-Q5_K_M | 397 |
Fook-Yi-34B-32K-v1.Q5_K_M | 398 |
Fook-Yi-34B-32K-v1.Q5_1 | 416 |
Fook-Yi-34B-32K-v1.i1-Q5_1 | 416 |
Fook-Yi-34B-32K-v1.i1-Q6_K | 455 |
Fook-Yi-34B-32K-v1.Q6_K | 456 |
Fook-Yi-34B-32K-v1.Q8_0 | 547 |
Fook-Yi-34B-32K-v1.SOURCE | 975 |
ARC easy performance:
Filename | Time (seconds) |
---|---|
Fook-Yi-34B-32K-v1.i1-IQ1_S | 106 |
Fook-Yi-34B-32K-v1.i1-IQ1_M | 117 |
Fook-Yi-34B-32K-v1.i1-IQ2_XXS | 132 |
Fook-Yi-34B-32K-v1.i1-IQ2_XS | 139 |
Fook-Yi-34B-32K-v1.i1-IQ2_S | 148 |
Fook-Yi-34B-32K-v1.i1-IQ2_M | 159 |
Fook-Yi-34B-32K-v1.Q2_K | 165 |
Fook-Yi-34B-32K-v1.i1-Q2_K_S | 170 |
Fook-Yi-34B-32K-v1.i1-Q2_K | 172 |
Fook-Yi-34B-32K-v1.i1-IQ3_XXS | 176 |
Fook-Yi-34B-32K-v1.IQ3_XS | 185 |
Fook-Yi-34B-32K-v1.i1-IQ3_XS | 189 |
Fook-Yi-34B-32K-v1.IQ3_S | 190 |
Fook-Yi-34B-32K-v1.Q3_K_S | 190 |
Fook-Yi-34B-32K-v1.i1-IQ3_S | 198 |
Fook-Yi-34B-32K-v1.i1-Q3_K_S | 203 |
Fook-Yi-34B-32K-v1.IQ3_M | 206 |
Fook-Yi-34B-32K-v1.Q3_K_M | 206 |
Fook-Yi-34B-32K-v1.i1-IQ3_M | 207 |
Fook-Yi-34B-32K-v1.i1-Q3_K_M | 225 |
Fook-Yi-34B-32K-v1.IQ4_XS | 229 |
Fook-Yi-34B-32K-v1.Q3_K_L | 230 |
Fook-Yi-34B-32K-v1.IQ4_NL | 242 |
Fook-Yi-34B-32K-v1.i1-Q3_K_L | 242 |
Fook-Yi-34B-32K-v1.Q4_K_S | 243 |
Fook-Yi-34B-32K-v1.i1-IQ4_XS | 243 |
Fook-Yi-34B-32K-v1.Q4_0 | 245 |
Fook-Yi-34B-32K-v1.i1-IQ4_NL | 256 |
Fook-Yi-34B-32K-v1.SOURCE | 257 |
Fook-Yi-34B-32K-v1.i1-Q4_K_S | 262 |
Fook-Yi-34B-32K-v1.i1-Q4_0 | 270 |
Fook-Yi-34B-32K-v1.Q4_1 | 274 |
Fook-Yi-34B-32K-v1.Q4_K_M | 275 |
Fook-Yi-34B-32K-v1.i1-Q4_K_M | 276 |
Fook-Yi-34B-32K-v1.Q5_0 | 287 |
Fook-Yi-34B-32K-v1.Q5_K_S | 295 |
Fook-Yi-34B-32K-v1.Q5_K_M | 301 |
Fook-Yi-34B-32K-v1.i1-Q4_1 | 301 |
Fook-Yi-34B-32K-v1.Q5_1 | 320 |
Fook-Yi-34B-32K-v1.i1-Q5_K_S | 323 |
Fook-Yi-34B-32K-v1.i1-Q5_0 | 332 |
Fook-Yi-34B-32K-v1.Q6_K | 347 |
Fook-Yi-34B-32K-v1.i1-Q5_K_M | 352 |
Fook-Yi-34B-32K-v1.i1-Q5_1 | 356 |
Fook-Yi-34B-32K-v1.i1-Q6_K | 386 |
Fook-Yi-34B-32K-v1.Q8_0 | 458 |
ARC challenge performance
Filename | Time (seconds) |
---|---|
Fook-Yi-34B-32K-v1.i1-IQ1_S | 23 |
Fook-Yi-34B-32K-v1.i1-IQ2_XXS | 25 |
Fook-Yi-34B-32K-v1.i1-IQ1_M | 26 |
Fook-Yi-34B-32K-v1.i1-IQ2_S | 27 |
Fook-Yi-34B-32K-v1.i1-IQ2_XS | 27 |
Fook-Yi-34B-32K-v1.i1-IQ2_M | 30 |
Fook-Yi-34B-32K-v1.i1-IQ3_XS | 31 |
Fook-Yi-34B-32K-v1.i1-IQ3_XXS | 31 |
Fook-Yi-34B-32K-v1.Q2_K | 32 |
Fook-Yi-34B-32K-v1.i1-Q2_K | 32 |
Fook-Yi-34B-32K-v1.i1-Q2_K_S | 32 |
Fook-Yi-34B-32K-v1.IQ3_XS | 33 |
Fook-Yi-34B-32K-v1.Q3_K_S | 33 |
Fook-Yi-34B-32K-v1.i1-IQ3_M | 33 |
Fook-Yi-34B-32K-v1.i1-IQ3_S | 33 |
Fook-Yi-34B-32K-v1.i1-Q3_K_S | 33 |
Fook-Yi-34B-32K-v1.IQ3_M | 34 |
Fook-Yi-34B-32K-v1.IQ3_S | 34 |
Fook-Yi-34B-32K-v1.i1-Q3_K_M | 35 |
Fook-Yi-34B-32K-v1.Q3_K_M | 36 |
Fook-Yi-34B-32K-v1.IQ4_XS | 37 |
Fook-Yi-34B-32K-v1.Q3_K_L | 38 |
Fook-Yi-34B-32K-v1.Q4_0 | 38 |
Fook-Yi-34B-32K-v1.i1-IQ4_XS | 38 |
Fook-Yi-34B-32K-v1.i1-Q3_K_L | 38 |
Fook-Yi-34B-32K-v1.i1-IQ4_NL | 39 |
Fook-Yi-34B-32K-v1.i1-Q4_0 | 39 |
Fook-Yi-34B-32K-v1.i1-Q4_K_S | 39 |
Fook-Yi-34B-32K-v1.IQ4_NL | 40 |
Fook-Yi-34B-32K-v1.Q4_K_S | 40 |
Fook-Yi-34B-32K-v1.Q4_K_M | 41 |
Fook-Yi-34B-32K-v1.i1-Q4_K_M | 41 |
Fook-Yi-34B-32K-v1.Q4_1 | 42 |
Fook-Yi-34B-32K-v1.i1-Q4_1 | 42 |
Fook-Yi-34B-32K-v1.i1-Q5_0 | 44 |
Fook-Yi-34B-32K-v1.Q5_0 | 45 |
Fook-Yi-34B-32K-v1.Q5_K_M | 45 |
Fook-Yi-34B-32K-v1.Q5_K_S | 45 |
Fook-Yi-34B-32K-v1.i1-Q5_K_S | 45 |
Fook-Yi-34B-32K-v1.i1-Q5_K_M | 46 |
Fook-Yi-34B-32K-v1.Q5_1 | 48 |
Fook-Yi-34B-32K-v1.i1-Q5_1 | 48 |
Fook-Yi-34B-32K-v1.i1-Q6_K | 52 |
Fook-Yi-34B-32K-v1.Q6_K | 53 |
Fook-Yi-34B-32K-v1.Q8_0 | 63 |
Fook-Yi-34B-32K-v1.SOURCE | 110 |
Winogrande performance
Filename | Time (seconds) |
---|---|
Fook-Yi-34B-32K-v1.i1-IQ1_S | 45 |
Fook-Yi-34B-32K-v1.i1-IQ2_XXS | 50 |
Fook-Yi-34B-32K-v1.i1-IQ1_M | 54 |
Fook-Yi-34B-32K-v1.i1-IQ2_XS | 56 |
Fook-Yi-34B-32K-v1.i1-IQ2_S | 57 |
Fook-Yi-34B-32K-v1.i1-IQ2_M | 60 |
Fook-Yi-34B-32K-v1.i1-IQ3_XXS | 63 |
Fook-Yi-34B-32K-v1.i1-Q2_K_S | 64 |
Fook-Yi-34B-32K-v1.i1-IQ3_XS | 65 |
Fook-Yi-34B-32K-v1.i1-Q2_K | 65 |
Fook-Yi-34B-32K-v1.Q2_K | 66 |
Fook-Yi-34B-32K-v1.i1-IQ3_S | 68 |
Fook-Yi-34B-32K-v1.i1-Q3_K_S | 68 |
Fook-Yi-34B-32K-v1.IQ3_S | 69 |
Fook-Yi-34B-32K-v1.i1-IQ3_M | 69 |
Fook-Yi-34B-32K-v1.IQ3_M | 70 |
Fook-Yi-34B-32K-v1.IQ3_XS | 70 |
Fook-Yi-34B-32K-v1.Q3_K_S | 70 |
Fook-Yi-34B-32K-v1.i1-Q3_K_M | 73 |
Fook-Yi-34B-32K-v1.Q3_K_M | 75 |
Fook-Yi-34B-32K-v1.i1-IQ4_XS | 77 |
Fook-Yi-34B-32K-v1.IQ4_XS | 78 |
Fook-Yi-34B-32K-v1.Q3_K_L | 78 |
Fook-Yi-34B-32K-v1.i1-Q3_K_L | 78 |
Fook-Yi-34B-32K-v1.Q4_0 | 79 |
Fook-Yi-34B-32K-v1.i1-Q4_0 | 80 |
Fook-Yi-34B-32K-v1.i1-Q4_K_S | 80 |
Fook-Yi-34B-32K-v1.IQ4_NL | 81 |
Fook-Yi-34B-32K-v1.i1-IQ4_NL | 81 |
Fook-Yi-34B-32K-v1.Q4_K_S | 82 |
Fook-Yi-34B-32K-v1.Q4_K_M | 83 |
Fook-Yi-34B-32K-v1.i1-Q4_K_M | 84 |
Fook-Yi-34B-32K-v1.Q4_1 | 87 |
Fook-Yi-34B-32K-v1.i1-Q4_1 | 87 |
Fook-Yi-34B-32K-v1.Q5_0 | 93 |
Fook-Yi-34B-32K-v1.i1-Q5_0 | 93 |
Fook-Yi-34B-32K-v1.i1-Q5_K_S | 93 |
Fook-Yi-34B-32K-v1.Q5_K_M | 95 |
Fook-Yi-34B-32K-v1.Q5_K_S | 95 |
Fook-Yi-34B-32K-v1.i1-Q5_K_M | 95 |
Fook-Yi-34B-32K-v1.Q5_1 | 100 |
Fook-Yi-34B-32K-v1.i1-Q5_1 | 100 |
Fook-Yi-34B-32K-v1.i1-Q6_K | 109 |
Fook-Yi-34B-32K-v1.Q6_K | 111 |
Fook-Yi-34B-32K-v1.Q8_0 | 130 |
Fook-Yi-34B-32K-v1.SOURCE | 227 |
Ah, the claim was that Q4_1 uses less energy than Q4_K_M ("for the quality"), which is a lot of variables. And is, however, is also not backed up by your benchmarks (assuming longer time means more energy usage), unless somehow "for the quality" figures in in favour of Q4_1, which, again seems to be not the case.
@yttria ,it seems Q4_1 is bigger, slower and worse than Q4_K_M, and not that much better than Q4_0, which is even faster.
PS: I didn't try to make you make these benchmarks, but I take them :-) However, they do seem a bit fishy - they follow more or less exactly the quant size, indicating a memory bottleneck, so cpu speed doesn't even figure in. I would assume yttria did it on a cpu with a lot fewer cores, where things can be dramatically different. Which is why I provide Q4_0 in the first place, as a fast quant for cpus. I hope one of the results on all this benchmarking (quality and speed) is to tell us once and for all if Q4_0 actually is useful (it might well be that another quant gives better a better quality/time ratio).
I didn't try to make you make these benchmarks, but I take them :-) However, they do seem a bit fishy - they follow more or less exactly the quant size, indicating a memory bottleneck, so cpu speed doesn't even figure in. I would assume yttria did it on a cpu with a lot fewer cores, where things can be dramatically different. Which is why I provide Q4_0 in the first place, as a fast quant for cpus. I hope one of the results on all this benchmarking (quality and speed) is to tell us once and for all if Q4_0 actually is useful (it might well be that another quant gives better a better quality/time ratio).
Great point. I'm aware that the shared performance test results are not perfect which is not surprising as I never indended thouse tests to be used to measure quant performance . I did all tests on a Threadripper PRO 7975WX (32 core 64 threads) while also using the GPU to do compuatations. This resulted in super fast prompt evaluation during prompt evaluation heavy tests. So thouse tests are indeed almost certainly memory bottlenecked which might not always be the case during normal use depending on the hardware. I'm wondering if there are really any realistic cases where LLMs are not memory bottlenecked. After just a few threads I start seeing deminishing returns when adding more. Usual consumer hardware has just dual channel memory and so should be bottelnacked even faster. I definately want to do more carefull testing regarding this and maybe even test on more mainstream hardware like my laptop. I actually already did many GGUF performance tests one year ago during the planing stage of my StormPeak build. I did so by changing the memory channels and memory clock-speed and amount of threads assigned to llama.cpp. The conclusion back then was pritty much the more memory channels and the faster memory I get the better the performance of GGUF files executed on the CPU will be. This was the main reason I decided to go for the much more expensive Threadripper PRO with octa-channel instead of the normal Threadripper lineup with quad-channel memory for my StormPeak node.
I agree with your methodology, these side effects were of course not the goal. There are lots of realistic cases where the cpu is the bottleneck, though. In fact, most cpus will be the bottleneck when confronted with IQ quants, and memory bottlenecked with normal Q quants. Which is totally in line with your experience (no IQ quants last year). There are even lots of systems where Q4_K_M might be a bottleneck, which is why I added Q4_0 quants back - purely for speed.
Whenever I get to my performance benchmarks, my plan is to make a 4-core baseline test for specifically slower cpus. There are a lot of people who run smaller models on not very beefy laptops, for example.
This is my test on M3 Max processing and generating a fixed number of tokens with a 7B model:
Prompt processing
Quant | Time / s | Energy / J |
---|---|---|
F16 | 5.1 | 254 |
Q8_0 | 5.3 | 270 |
Q6_K | 6.2 | 320 |
Q5_1 | 5.8 | 290 |
Q5_K | 6.3 | 352 |
Q5_0 | 5.8 | 290 |
Q5_K_S | 6.3 | 322 |
Q4_1 | 5.3 | 257 |
Q4_K | 5.7 | 290 |
IQ4_NL | 5.5 | 276 |
Q4_K_S | 5.6 | 285 |
Q4_0 | 5.3 | 259 |
IQ4_XS | 5.5 | 283 |
Token generation
Quant | Time / s | Energy / J |
---|---|---|
F16 | 21.2 | 476 |
Q8_0 | 12.2 | 323 |
Q6_K | 10.0 | 412 |
Q5_1 | 10.3 | 416 |
Q5_K | 10.4 | 477 |
Q5_0 | 9.6 | 391 |
Q5_K_S | 10.4 | 489 |
Q4_1 | 8.2 | 241 |
Q4_K | 8.3 | 312 |
IQ4_NL | 8.4 | 345 |
Q4_K_S | 8.1 | 304 |
Q4_0 | 7.6 | 227 |
IQ4_XS | 7.7 | 307 |
well, the Q4_0 seems to more energy-efficient than Q4_1, and is probably higher quality per bit, too
Another thing, I see you are providing f16 ggufs for some bf16 models. Wouldn't it be better to to convert directly to bf16 gguf to eliminate conversion loss?
since f16 has higher precision and weights should be mostly normalised, there shouldn't be any unless the model already has issues (and these issues would translate to other quants as well). the purpose of the f16 quant is not to provide a faithful representation of the source (which would be provided as SOURCE gguf) but to provide an actual f16 quant, i.e. pretty much the same purpose as the Q4_0, to provide a quant for certain targets.
f16 is for sure more useful then bf16 if you consider you require at least a Nvidia Ampere based GPU for it to support bf16 while f16 is already supported since Tegra X1 (Maxwell+) and so works on Pascal and Turin as and not just Ampere and Ada. This is also why back at university I used a Nintendo Switch console running Linux for scientific computation as my GTX 980 Ti Maxwell GPU did not support half precision. Ironically on the CPU side it is exactly the opposite picture: While bf16 is already widely supported, f16 is not.
There are some rare models where you can find f16, bf16 and SOURCE quants like this one: https://huggingface.co./mradermacher/Fook-Yi-34B-32K-v1-GGUF. SOURCE quants are usually only provided if obtaining the original model is not easily possible. For example, if a model got deleted by the author after mradermacher already downloaded it and he notices it before deleting them.
This is my test on M3 Max processing and generating a fixed number of tokens
Thanks a lot for sharing your measurements. What application did you use to create them? Probably a macOS thing as there seems no simple way for me to measure power consumption used for generating tokens unless I do everything on the GPU.
@nicoboss your cpu should have a power meter, try /sys/class/powercap/intel-rapl/*/energy_uj - might even have one per core. don't know how good the amd implementation is but the intel estimate is usually pretty good, and amd usually does better.
What application did you use to create them?
Energy is measured with the built-in powermetrics utility on macOS.
your cpu should have a power meter, try /sys/class/powercap/intel-rapl/*/energy_uj
That is really cool. Wasn't aware of it. AMDs intel-rapl implementation is decent. There also is amd_energy which is so accurate it got kicked out of the linux kernel due to security concerns regarding side-channel attacks but I can just build and load it as a kernel module.
After more testing with 70B, I think Q4_1 is really worth it for Mac. Here is a comparison of the most efficient quants on Mac. PP and TG energy usage are normalized to q4_0 on high-power mode for a fixed number of tokens. As you can see, q4_1 gives almost all of the perplexity reduction of q4_k with almost none of the performance impact.
High-power mode
Model | PP (J/T) | TG (J/T) | Perplexity |
---|---|---|---|
q4_0 | 100% | 100% | 2.9112 |
q4_1 | 104% | 105% | 2.7753 |
q4_k | 124% | 133% | 2.7565 |
Low-power mode
Model | PP (J/T) | TG (J/T) | Perplexity |
---|---|---|---|
q4_0 | 61% | 35% | 2.9112 |
q4_1 | 61% | 37% | 2.7753 |
q4_k | 70% | 63% | 2.7565 |
I'm confused @yttria , doesn't that last chart directly contradict your conclusion? Q4_K has better perplexity and better speed per Joule than q4_0 and q4_1
Sorry for the confusion. The numbers are all in joules per token, normalized to q4_0 on high-power mode. Less joules per token is better.
(just saw your update)
And you're using Metal for this? I'm very surprised.
Also what are you using for your perplexity that's getting such low results? They confuse me since typically the change from Q4_0 -> Q4_1 is the same as Q4_1 to Q4_K, so I would expect to see 2.5 for Q4_K
@yttria In the past month I spent weeks performing performance measurements of all quants on many different hardware configurations and llama.cpp backends. You can find my raw results under https://www.nicobosshard.ch/perfData.zip and quality measurements under http://www.nicobosshard.ch/LLM-Eval_Quality_v1.tar.zst
Even on ARM (but not Apple silicon) I saw much better performance using Q4_0 or Q4_K_S and even Q4_K_M compared to Q4_1.
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| qwen2 3B Q4_0 | 1.69 GiB | 3.09 B | CPU | 4 | pp128 | 3.82 Β± 0.00 |
| qwen2 3B Q4_0 | 1.69 GiB | 3.09 B | CPU | 4 | tg128 | 1.91 Β± 0.00 |
| qwen2 3B Q4_K - Small | 1.70 GiB | 3.09 B | CPU | 4 | pp128 | 4.04 Β± 0.00 |
| qwen2 3B Q4_K - Small | 1.70 GiB | 3.09 B | CPU | 4 | tg128 | 1.88 Β± 0.00 |
| qwen2 3B Q4_K - Medium | 1.79 GiB | 3.09 B | CPU | 4 | pp128 | 3.89 Β± 0.00 |
| qwen2 3B Q4_K - Medium | 1.79 GiB | 3.09 B | CPU | 4 | tg128 | 1.82 Β± 0.00 |
| qwen2 3B Q4_1 | 1.85 GiB | 3.09 B | CPU | 4 | pp128 | 3.73 Β± 0.00 |
| qwen2 3B Q4_1 | 1.85 GiB | 3.09 B | CPU | 4 | tg128 | 1.78 Β± 0.00 |
Not only that but according the month I spend doing quality measurements (whose final quality numbers are listed under https://hf.tst.eu/model#Qwen2.5-3B-i1-GGUF) you can see that Q4_1 has by far the worst quality of all Q4 quants despite being the largest Q4 quant unless you use imatrix/wighted quants:
Quant | Quality | Size |
---|---|---|
Q4_1 | 80 | 2.0 GB |
Q4_0 | 81 | 1.8 GB |
Q4_0_4_4 | 81 | 1.8 GB |
Q4_0_4_8 | 81 | 1.8 GB |
Q4_0_8_8 | 81 | 1.8 GB |
Q4_K_S | 81 | 1.8 GB |
IQ4_NL | 82 | 1.8 GB |
IQ4_XS | 82 | 1.8 GB |
Q4_K_M | 85 | 1.9 GB |
And you're using Metal for this? I'm very surprised.
Yes, this is with Metal. It is calculated by dividing Watts from powermetrics (numerator) by T/s from llama-bench (denominator). W/(T/s) simplifies to Joules per token, which is then normalized. Joules per token is a very important metric for MacBooks because of inherent thermal constraints.
Why are you surprised by these results? My interpretation is that K-quants require more complex calculations, and therefore more energy.
Please see here for a comparison of all quants q4 and above with a 7B model:
https://huggingface.co./mradermacher/model_requests/discussions/299#66ef7cdc3b250e9eca0a5e32
Also what are you using for your perplexity that's getting such low results? They confuse me since typically the change from Q4_0 -> Q4_1 is the same as Q4_1 to Q4_K, so I would expect to see 2.5 for Q4_K
The perplexity is measured with wiki.test.raw on Meta-Llama-3.1-70B-Instruct quantized with an imatrix made with calibration_datav3.txt. I did not complete the full tests because it takes a long time for 70B models, so I only averaged the results obtained from the first parts of the tests. Please let me know if it would be useful to perform the tests completely.
Here the full data from http://www.nicobosshard.ch/LLM-Eval_Quality_v1.tar.zst as table. I averaged together the measurements of the base and instruct model to get more accurate results better representing values of derivative models.
Legend
Quant: Used quant
Rank: Ranking position of the quality of a specific quant compared to other quants based on the ranking of the average of rank with 25% KL-divergence, 40% Correct token, 20% Same token and 15% Perplexity
KL-divergence: 100 - Mean KLD * 100
Correct token: Mean Ξp + 100
Same token: Same top p
Perplexity: 100 + (100 - (Mean PPL(Q)/PPL(base)) * 100)
Eval: Weighted average [n questions](ARC Easy(Q)/ARC Easy(base) , ARC Challenge(Q)/ARC Challenge(base), MMLU(Q)/MMMLU(base), WinoGrande(Q)/WinoGrande(base))
Qwen2.5-0.5B & Qwen2.5-0.5B-Instruct
Quant | Rank | KL-divergence | Correct token | Same token | Perplexity | Eval |
---|---|---|---|---|---|---|
f16 | 1 | 99.98 | 100.00 | 98.95 | 99.88 | 100.00 |
i1-Q6_K | 2 | 99.65 | 99.96 | 96.38 | 99.62 | 100.17 |
Q8_0 | 3 | 99.75 | 99.95 | 96.90 | 99.60 | 99.95 |
Q6_K | 4 | 99.57 | 99.94 | 96.06 | 99.49 | 100.35 |
i1-Q5_K_M | 5 | 98.43 | 99.76 | 92.90 | 98.31 | 100.47 |
i1-Q5_0 | 6 | 97.30 | 99.78 | 91.02 | 97.92 | 98.99 |
i1-Q4_K_M | 7 | 97.49 | 99.74 | 91.17 | 97.96 | 99.45 |
i1-Q5_1 | 8 | 98.13 | 99.68 | 92.35 | 98.19 | 100.64 |
i1-Q5_K_S | 9 | 98.09 | 99.69 | 92.30 | 98.05 | 100.32 |
Q5_0 | 10 | 96.53 | 99.73 | 89.88 | 97.18 | 99.42 |
Q5_K_M | 11 | 97.46 | 99.57 | 91.15 | 97.06 | 98.93 |
Q4_K_M | 12 | 96.15 | 99.75 | 89.35 | 96.57 | 98.11 |
i1-Q4_K_S | 13 | 96.69 | 99.63 | 90.16 | 97.09 | 98.26 |
Q5_K_S | 14 | 96.86 | 99.47 | 90.17 | 96.52 | 99.49 |
Q5_1 | 15 | 96.62 | 99.48 | 89.68 | 96.26 | 99.57 |
Q4_K_S | 16 | 94.72 | 99.58 | 87.58 | 95.11 | 99.08 |
i1-Q3_K_L | 17 | 95.79 | 99.35 | 88.81 | 96.12 | 99.70 |
i1-Q3_K_M | 18 | 94.70 | 99.33 | 87.57 | 95.17 | 97.99 |
i1-IQ4_NL | 19 | 94.28 | 99.28 | 87.21 | 95.15 | 99.75 |
i1-IQ4_XS | 20 | 94.24 | 99.26 | 87.18 | 95.09 | 100.12 |
Q3_K_L | 21 | 94.01 | 99.24 | 86.76 | 94.48 | 99.13 |
i1-Q4_1 | 22 | 94.22 | 98.97 | 87.27 | 93.60 | 98.96 |
Q3_K_M | 23 | 91.86 | 99.08 | 84.82 | 92.21 | 98.70 |
IQ4_NL | 24 | 91.91 | 99.03 | 84.61 | 92.66 | 100.11 |
IQ4_XS | 25 | 91.84 | 99.01 | 84.55 | 92.53 | 100.81 |
Q4_1 | 26 | 88.12 | 98.45 | 81.91 | 88.54 | 99.70 |
i1-Q4_0 | 27 | 89.19 | 98.42 | 82.93 | 88.41 | 101.03 |
i1-IQ3_M | 28 | 91.57 | 98.19 | 84.81 | 91.68 | 99.41 |
i1-IQ3_S | 29 | 90.68 | 98.11 | 84.19 | 91.14 | 99.44 |
i1-IQ3_XS | 30 | 90.68 | 98.11 | 84.19 | 91.14 | 99.44 |
i1-Q4_0_4_4 | 31 | 86.47 | 98.21 | 81.05 | 86.24 | 100.20 |
i1-Q4_0_8_8 | 32 | 86.47 | 98.21 | 81.06 | 86.24 | 100.12 |
Q4_0_4_4 | 33 | 86.47 | 98.21 | 81.05 | 86.24 | 100.20 |
Q4_0 | 34 | 86.47 | 98.21 | 81.05 | 86.23 | 100.09 |
i1-Q4_0_4_8 | 35 | 86.46 | 98.21 | 81.04 | 86.26 | 100.68 |
i1-IQ3_XXS | 36 | 87.41 | 97.89 | 81.90 | 87.74 | 99.24 |
i1-Q2_K | 37 | 84.69 | 98.20 | 79.72 | 84.62 | 98.45 |
i1-Q3_K_S | 38 | 84.38 | 98.20 | 79.50 | 83.64 | 98.12 |
i1-Q2_K_S | 39 | 79.46 | 97.26 | 77.37 | 79.16 | 96.01 |
i1-IQ2_M | 40 | 79.57 | 97.01 | 77.30 | 78.79 | 97.07 |
Q2_K | 41 | 78.09 | 97.09 | 75.60 | 77.24 | 96.27 |
Q3_K_S | 42 | 76.07 | 96.70 | 74.54 | 73.89 | 96.67 |
i1-IQ2_S | 43 | 74.86 | 96.35 | 75.16 | 72.95 | 94.92 |
i1-IQ2_XS | 44 | 71.95 | 95.88 | 74.07 | 69.24 | 95.13 |
IQ3_M | 45 | 72.45 | 94.85 | 72.61 | 71.49 | 94.11 |
IQ3_S | 46 | 66.83 | 93.93 | 70.41 | 64.94 | 94.16 |
i1-IQ2_XXS | 47 | 64.03 | 94.76 | 70.96 | 58.20 | 93.34 |
IQ3_XS | 48 | 66.83 | 93.93 | 70.41 | 64.94 | 94.16 |
i1-IQ1_M | 49 | 46.20 | 92.18 | 65.53 | 32.22 | 92.40 |
i1-IQ1_S | 50 | 33.15 | 90.26 | 60.99 | 10.25 | 92.66 |
Qwen2.5-1.5B & Qwen2.5-1.5B-Instruct
Quant | Rank | KL-divergence | Correct token | Same token | Perplexity | Eval |
---|---|---|---|---|---|---|
f16 | 1 | 99.96 | 100.00 | 98.63 | 99.85 | 99.49 |
Q8_0 | 2 | 99.75 | 99.95 | 96.80 | 99.70 | 99.35 |
i1-Q6_K | 3 | 99.35 | 99.94 | 95.15 | 99.36 | 99.63 |
Q6_K | 4 | 99.23 | 99.90 | 94.76 | 99.20 | 100.27 |
i1-Q5_K_M | 5 | 98.61 | 99.84 | 93.43 | 98.89 | 100.10 |
i1-Q5_K_S | 6 | 98.40 | 99.81 | 93.07 | 98.81 | 99.74 |
i1-Q5_1 | 7 | 98.50 | 99.76 | 93.17 | 98.72 | 99.97 |
i1-Q5_0 | 8 | 98.08 | 99.82 | 92.48 | 98.46 | 100.34 |
Q5_K_M | 9 | 98.13 | 99.73 | 92.53 | 98.45 | 99.97 |
Q5_0 | 10 | 97.48 | 99.83 | 91.56 | 97.93 | 100.28 |
Q5_K_S | 11 | 97.81 | 99.69 | 92.01 | 98.19 | 100.10 |
Q5_1 | 12 | 97.65 | 99.63 | 91.76 | 98.63 | 100.74 |
i1-Q4_K_M | 13 | 96.61 | 99.51 | 90.54 | 96.85 | 98.50 |
i1-IQ4_NL | 14 | 95.71 | 99.40 | 89.33 | 96.33 | 98.02 |
i1-Q4_1 | 15 | 96.02 | 99.37 | 89.79 | 96.35 | 99.18 |
i1-Q4_K_S | 16 | 96.02 | 99.39 | 89.68 | 96.30 | 98.95 |
i1-IQ4_XS | 17 | 95.61 | 99.39 | 89.29 | 96.21 | 97.74 |
Q4_K_M | 18 | 94.67 | 99.29 | 88.33 | 94.89 | 99.05 |
IQ4_NL | 19 | 94.20 | 99.07 | 87.83 | 94.45 | 97.55 |
IQ4_XS | 20 | 94.12 | 98.98 | 87.69 | 94.35 | 97.79 |
Q4_K_S | 21 | 93.64 | 99.00 | 87.15 | 93.91 | 98.39 |
i1-Q4_0 | 22 | 93.27 | 98.85 | 86.85 | 93.62 | 95.66 |
Q4_1 | 23 | 92.16 | 98.81 | 85.93 | 93.62 | 99.01 |
i1-Q4_0_8_8 | 24 | 90.63 | 98.30 | 84.55 | 90.89 | 95.02 |
Q4_0_8_8 | 25 | 90.63 | 98.30 | 84.55 | 90.89 | 95.02 |
i1-Q4_0_4_4 | 26 | 90.62 | 98.30 | 84.57 | 90.88 | 95.41 |
i1-Q3_K_L | 27 | 89.82 | 98.78 | 84.15 | 90.87 | 96.86 |
i1-Q4_0_4_8 | 28 | 90.62 | 98.30 | 84.54 | 90.87 | 94.63 |
Q4_0 | 29 | 90.62 | 98.30 | 84.58 | 90.86 | 94.73 |
Q4_0_4_4 | 30 | 90.62 | 98.30 | 84.57 | 90.88 | 95.41 |
i1-Q3_K_M | 31 | 88.46 | 98.43 | 83.29 | 89.37 | 96.24 |
Q4_0_4_8 | 32 | 90.62 | 98.30 | 84.54 | 90.87 | 94.63 |
Q3_K_L | 33 | 85.22 | 98.12 | 81.21 | 85.40 | 93.59 |
i1-IQ3_S | 34 | 86.96 | 97.27 | 82.64 | 87.23 | 94.36 |
i1-IQ3_M | 35 | 87.36 | 96.81 | 82.82 | 87.87 | 95.64 |
Q3_K_M | 36 | 82.13 | 97.44 | 79.33 | 81.82 | 94.54 |
i1-IQ3_XS | 37 | 83.57 | 97.17 | 80.56 | 84.34 | 94.86 |
i1-Q3_K_S | 38 | 78.63 | 96.16 | 77.80 | 78.76 | 91.48 |
i1-IQ3_XXS | 39 | 76.39 | 96.23 | 76.38 | 77.37 | 95.26 |
Q3_K_S | 40 | 69.85 | 94.87 | 73.55 | 67.40 | 92.18 |
i1-Q2_K | 41 | 64.13 | 93.99 | 72.13 | 61.91 | 89.15 |
IQ3_M | 42 | 64.11 | 92.23 | 72.34 | 61.28 | 89.26 |
IQ3_S | 43 | 55.66 | 92.38 | 69.81 | 50.07 | 89.04 |
i1-Q2_K_S | 44 | 51.30 | 92.47 | 67.78 | 43.70 | 86.60 |
i1-IQ2_M | 45 | 55.11 | 92.23 | 68.34 | 48.23 | 89.28 |
IQ3_XS | 46 | 38.63 | 90.37 | 64.25 | 23.02 | 87.43 |
i1-IQ2_S | 47 | 39.28 | 90.11 | 63.67 | 24.50 | 86.51 |
i1-IQ2_XS | 48 | 28.03 | 88.67 | 61.81 | 3.74 | 85.87 |
Q2_K | 49 | 6.27 | 86.20 | 56.81 | -45.62 | 83.48 |
i1-IQ2_XXS | 50 | -20.02 | 83.20 | 52.21 | -115.59 | 81.66 |
i1-IQ1_M | 51 | -98.45 | 74.63 | 39.42 | -499.91 | 75.59 |
i1-IQ1_S | 52 | -197.38 | 69.58 | 28.07 | -1669.73 | 69.42 |
Qwen2.5-3B & Qwen2.5-3B-Instruct
Quant | Rank | KL-divergence | Correct token | Same token | Perplexity | Eval |
---|---|---|---|---|---|---|
f16 | 1 | 99.96 | 100.00 | 98.65 | 99.79 | 99.90 |
Q8_0 | 2 | 99.77 | 99.95 | 97.10 | 99.59 | 100.06 |
i1-Q6_K | 3 | 99.40 | 99.92 | 95.44 | 99.07 | 100.24 |
Q6_K | 4 | 99.26 | 99.89 | 94.96 | 99.02 | 100.27 |
i1-Q5_K_M | 5 | 98.66 | 99.83 | 93.78 | 98.54 | 99.74 |
i1-Q5_1 | 6 | 98.60 | 99.74 | 93.69 | 98.63 | 100.12 |
i1-Q5_K_S | 7 | 98.49 | 99.77 | 93.42 | 98.55 | 99.75 |
i1-Q5_0 | 8 | 98.08 | 99.81 | 92.90 | 97.85 | 99.38 |
Q5_K_M | 9 | 98.10 | 99.69 | 92.80 | 98.36 | 100.68 |
Q5_0 | 10 | 97.48 | 99.81 | 92.05 | 97.33 | 98.76 |
Q5_K_S | 11 | 97.76 | 99.54 | 92.34 | 98.26 | 100.90 |
Q5_1 | 12 | 97.49 | 99.63 | 91.93 | 98.22 | 100.99 |
i1-Q4_K_M | 13 | 96.89 | 99.60 | 91.09 | 97.27 | 100.05 |
i1-Q4_K_S | 14 | 96.27 | 99.47 | 90.34 | 96.91 | 100.77 |
i1-Q4_1 | 15 | 96.31 | 99.44 | 90.44 | 96.88 | 100.39 |
i1-IQ4_NL | 16 | 95.99 | 99.52 | 90.09 | 96.76 | 99.21 |
i1-IQ4_XS | 17 | 95.89 | 99.44 | 89.98 | 96.54 | 98.45 |
IQ4_NL | 18 | 93.90 | 99.17 | 87.97 | 95.59 | 97.89 |
IQ4_XS | 19 | 93.89 | 99.14 | 87.95 | 95.60 | 97.15 |
Q4_K_M | 20 | 93.89 | 99.06 | 88.28 | 95.09 | 100.83 |
i1-Q4_0 | 21 | 93.11 | 98.97 | 87.38 | 94.62 | 99.02 |
Q4_K_S | 22 | 92.26 | 98.75 | 86.78 | 93.80 | 99.90 |
Q4_1 | 23 | 89.70 | 98.44 | 85.45 | 92.71 | 98.17 |
i1-IQ3_S | 24 | 87.79 | 97.80 | 83.58 | 89.67 | 100.90 |
Q4_0 | 25 | 87.28 | 97.66 | 83.92 | 89.91 | 97.50 |
i1-IQ3_M | 26 | 87.85 | 97.37 | 83.65 | 89.16 | 99.37 |
i1-IQ3_XS | 27 | 85.60 | 97.29 | 82.29 | 88.29 | 98.03 |
i1-IQ3_XXS | 28 | 79.29 | 97.01 | 78.65 | 82.18 | 96.15 |
i1-Q2_K | 29 | 68.72 | 95.84 | 74.74 | 69.14 | 90.34 |
i1-IQ2_M | 30 | 60.19 | 94.37 | 71.84 | 58.83 | 94.19 |
i1-Q2_K_S | 31 | 55.87 | 93.83 | 70.57 | 53.49 | 90.22 |
i1-Q3_K_L | 32 | 53.20 | 93.75 | 72.13 | 49.62 | 98.50 |
i1-Q3_K_M | 33 | 52.20 | 93.61 | 71.65 | 48.14 | 98.97 |
IQ3_S | 34 | 51.44 | 91.52 | 70.02 | 48.92 | 90.45 |
i1-IQ2_S | 35 | 46.68 | 92.31 | 67.76 | 39.91 | 90.39 |
i1-Q3_K_S | 36 | 42.34 | 92.29 | 68.70 | 33.15 | 93.93 |
IQ3_M | 37 | 51.85 | 90.87 | 70.22 | 48.03 | 90.00 |
Q3_K_L | 38 | 39.66 | 91.11 | 68.60 | 31.35 | 95.74 |
i1-IQ2_XS | 39 | 37.97 | 91.16 | 65.80 | 27.04 | 90.08 |
Q3_K_M | 40 | 37.12 | 90.90 | 67.68 | 26.65 | 94.59 |
Q3_K_S | 41 | 27.69 | 89.57 | 65.06 | 10.52 | 89.93 |
IQ3_XS | 42 | 28.65 | 88.53 | 63.80 | 10.41 | 87.12 |
i1-IQ2_XXS | 43 | -2.43 | 86.46 | 57.10 | -55.65 | 82.95 |
i1-IQ1_M | 44 | -60.77 | 78.59 | 46.48 | -249.64 | 78.80 |
i1-IQ1_S | 45 | -147.43 | 70.78 | 34.79 | -886.57 | 71.59 |
Q2_K | 46 | -678.42 | 57.94 | 1.00 | -213623.63 | 66.03 |
Qwen2.5-7B & Qwen2.5-7B-Instruct
Quant | Rank | KL-divergence | Correct token | Same token | Perplexity | Eval |
---|---|---|---|---|---|---|
f16 | 1 | 99.90 | 100.04 | 97.88 | 99.77 | 100.16 |
Q8_0 | 2 | 99.80 | 100.05 | 97.24 | 99.71 | 100.05 |
Q6_K | 3 | 99.50 | 100.02 | 95.93 | 99.59 | 99.95 |
i1-Q6_K | 4 | 99.58 | 100.01 | 96.16 | 99.57 | 100.35 |
i1-Q5_K_M | 5 | 99.08 | 99.98 | 94.93 | 99.19 | 100.17 |
Q5_K_M | 6 | 98.85 | 100.00 | 94.42 | 98.99 | 100.44 |
i1-Q5_K_S | 7 | 98.90 | 99.96 | 94.51 | 99.01 | 99.96 |
i1-Q5_1 | 8 | 98.96 | 99.93 | 94.67 | 99.00 | 100.10 |
Q5_K_S | 9 | 98.60 | 99.99 | 93.84 | 98.75 | 99.59 |
i1-Q5_0 | 10 | 98.71 | 99.98 | 94.26 | 98.57 | 98.82 |
Q5_1 | 11 | 98.46 | 99.96 | 93.53 | 98.48 | 100.41 |
Q5_0 | 12 | 98.39 | 99.93 | 93.48 | 98.52 | 99.49 |
i1-Q4_K_M | 13 | 97.72 | 99.85 | 92.67 | 98.31 | 99.82 |
i1-Q4_K_S | 14 | 97.19 | 99.84 | 91.96 | 97.81 | 100.00 |
Q4_K_M | 15 | 96.84 | 99.88 | 91.36 | 97.83 | 99.95 |
i1-Q4_1 | 16 | 97.24 | 99.79 | 92.06 | 97.78 | 99.83 |
i1-IQ4_NL | 17 | 97.10 | 99.79 | 91.88 | 97.84 | 99.42 |
i1-IQ4_XS | 18 | 97.02 | 99.81 | 91.74 | 97.60 | 99.72 |
Q4_K_S | 19 | 95.89 | 99.80 | 90.06 | 96.49 | 99.88 |
IQ4_NL | 20 | 96.36 | 99.50 | 90.64 | 97.36 | 99.28 |
Q4_1 | 21 | 95.27 | 99.58 | 89.61 | 96.79 | 100.32 |
IQ4_XS | 22 | 96.30 | 99.46 | 90.59 | 97.33 | 99.23 |
i1-Q4_0 | 23 | 95.66 | 99.33 | 89.99 | 97.74 | 99.11 |
i1-Q3_K_L | 24 | 93.80 | 99.56 | 88.59 | 96.06 | 101.22 |
Q4_0 | 25 | 94.68 | 99.37 | 88.94 | 96.15 | 98.58 |
i1-Q3_K_M | 26 | 92.87 | 99.48 | 87.93 | 95.44 | 101.90 |
Q3_K_L | 27 | 91.57 | 98.90 | 86.54 | 93.53 | 99.74 |
Q3_K_M | 28 | 90.18 | 98.73 | 85.58 | 92.19 | 100.02 |
i1-IQ3_S | 29 | 90.88 | 98.26 | 86.46 | 92.12 | 97.60 |
i1-IQ3_M | 30 | 90.95 | 97.78 | 86.46 | 92.39 | 96.47 |
i1-IQ3_XS | 31 | 89.44 | 97.96 | 85.42 | 91.82 | 97.90 |
i1-IQ3_XXS | 32 | 85.00 | 98.00 | 82.54 | 88.98 | 96.74 |
i1-Q3_K_S | 33 | 84.29 | 97.34 | 81.39 | 88.49 | 99.76 |
Q3_K_S | 34 | 82.35 | 96.80 | 80.12 | 86.96 | 98.88 |
i1-Q2_K | 35 | 75.58 | 96.66 | 77.14 | 80.21 | 95.72 |
IQ3_S | 36 | 71.25 | 96.86 | 75.69 | 67.17 | 95.48 |
i1-IQ2_M | 37 | 73.26 | 96.06 | 77.18 | 77.95 | 97.02 |
i1-Q2_K_S | 38 | 69.42 | 96.65 | 75.77 | 74.07 | 95.57 |
IQ3_M | 39 | 71.97 | 96.04 | 75.46 | 68.86 | 94.94 |
IQ3_XS | 40 | 67.97 | 96.25 | 74.34 | 65.55 | 93.62 |
i1-IQ2_S | 41 | 64.93 | 94.80 | 74.29 | 68.88 | 95.15 |
i1-IQ2_XS | 42 | 60.36 | 94.29 | 72.90 | 63.76 | 93.30 |
Q2_K | 43 | 57.79 | 93.10 | 71.12 | 59.86 | 93.36 |
i1-IQ2_XXS | 44 | 38.67 | 92.06 | 66.92 | 33.20 | 89.93 |
i1-IQ1_M | 45 | 3.38 | 85.17 | 57.64 | -27.34 | 84.52 |
i1-IQ1_S | 46 | -43.82 | 78.62 | 50.06 | -160.55 | 75.97 |
Qwen2.5-14B & Qwen2.5-14B-Instruct
Quant | Rank | KL-divergence | Correct token | Same token | Perplexity | Eval |
---|---|---|---|---|---|---|
Q8_0 | 1 | 99.74 | 99.99 | 97.07 | 99.75 | 99.43 |
i1-Q6_K | 2 | 99.39 | 99.98 | 96.00 | 99.41 | 99.42 |
Q6_K | 3 | 99.30 | 99.92 | 95.83 | 99.35 | 99.40 |
i1-Q5_1 | 4 | 98.44 | 99.81 | 94.52 | 98.89 | 100.39 |
i1-Q5_K_M | 5 | 98.54 | 99.79 | 94.65 | 98.83 | 100.27 |
i1-Q5_K_S | 6 | 98.32 | 99.75 | 94.32 | 98.73 | 99.73 |
i1-Q5_0 | 7 | 98.13 | 99.81 | 94.06 | 98.56 | 100.45 |
Q5_K_M | 8 | 98.26 | 99.71 | 94.13 | 98.71 | 99.27 |
Q5_K_S | 9 | 97.75 | 99.66 | 93.52 | 98.00 | 99.24 |
Q5_0 | 10 | 97.58 | 99.68 | 93.30 | 97.53 | 99.37 |
Q5_1 | 11 | 97.54 | 99.55 | 93.23 | 97.74 | 101.23 |
i1-Q4_K_M | 12 | 96.12 | 99.38 | 92.08 | 96.95 | 99.23 |
i1-Q4_1 | 13 | 95.47 | 99.26 | 91.54 | 96.36 | 99.99 |
i1-IQ4_NL | 14 | 95.33 | 99.30 | 91.34 | 96.20 | 99.64 |
i1-Q4_K_S | 15 | 95.40 | 99.26 | 91.46 | 96.43 | 100.01 |
i1-IQ4_XS | 16 | 95.25 | 99.25 | 91.30 | 96.19 | 99.81 |
Q4_K_M | 17 | 95.23 | 98.98 | 91.13 | 96.14 | 99.78 |
IQ4_NL | 18 | 94.23 | 98.79 | 89.97 | 95.35 | 100.70 |
IQ4_XS | 19 | 94.17 | 98.85 | 89.95 | 95.30 | 100.88 |
i1-Q4_0 | 20 | 93.64 | 98.89 | 89.85 | 94.51 | 99.62 |
Q4_K_S | 21 | 93.97 | 98.76 | 89.95 | 94.81 | 99.66 |
Q4_0 | 22 | 91.93 | 98.76 | 88.48 | 93.25 | 98.40 |
Q4_1 | 23 | 92.44 | 98.52 | 88.80 | 94.35 | 99.33 |
i1-Q3_K_L | 24 | 90.38 | 98.49 | 87.77 | 92.25 | 99.37 |
i1-Q3_K_M | 25 | 89.12 | 98.29 | 87.07 | 90.89 | 99.66 |
Q3_K_L | 26 | 87.85 | 97.70 | 86.11 | 89.59 | 98.19 |
Q3_K_M | 27 | 86.08 | 97.27 | 85.08 | 88.12 | 98.42 |
i1-IQ3_S | 28 | 86.55 | 97.00 | 85.89 | 88.08 | 98.00 |
i1-IQ3_M | 29 | 86.44 | 96.60 | 85.85 | 87.70 | 98.45 |
i1-Q3_K_S | 30 | 82.70 | 97.09 | 83.39 | 84.85 | 98.46 |
i1-IQ3_XS | 31 | 83.85 | 96.63 | 84.56 | 86.65 | 97.33 |
i1-IQ3_XXS | 32 | 79.47 | 96.39 | 81.99 | 82.60 | 98.32 |
Q3_K_S | 33 | 79.34 | 96.30 | 81.51 | 80.69 | 98.33 |
IQ3_M | 34 | 74.62 | 94.48 | 79.09 | 74.54 | 96.62 |
i1-Q2_K | 35 | 70.21 | 95.13 | 78.37 | 71.78 | 98.33 |
i1-IQ2_M | 36 | 67.03 | 94.27 | 76.98 | 68.35 | 93.62 |
IQ3_S | 37 | 68.37 | 93.48 | 76.90 | 70.50 | 95.50 |
i1-Q2_K_S | 38 | 63.04 | 93.87 | 76.13 | 63.36 | 96.18 |
IQ3_XS | 39 | 63.93 | 92.93 | 75.45 | 66.32 | 95.87 |
i1-IQ2_S | 40 | 58.26 | 92.85 | 73.94 | 58.79 | 93.27 |
Q2_K | 41 | 57.82 | 92.48 | 73.71 | 56.45 | 94.39 |
i1-IQ2_XS | 42 | 55.21 | 92.32 | 72.99 | 55.35 | 93.08 |
i1-IQ2_XXS | 43 | 40.57 | 90.06 | 68.77 | 34.21 | 92.07 |
i1-IQ1_M | 44 | -6.35 | 82.86 | 58.95 | -57.84 | 83.24 |
i1-IQ1_S | 45 | -48.26 | 76.71 | 52.44 | -190.96 | 77.53 |
Qwen2.5-32B & Qwen2.5-32B-Instruct
Quant | Rank | KL-divergence | Correct token | Same token | Perplexity | Eval |
---|---|---|---|---|---|---|
Q8_0 | 1 | 99.72 | 100.05 | 97.01 | 99.86 | 100.38 |
i1-Q6_K | 2 | 99.42 | 100.00 | 95.99 | 99.72 | 100.77 |
Q6_K | 3 | 99.38 | 99.98 | 95.88 | 99.60 | 100.70 |
i1-Q5_K_M | 4 | 98.76 | 99.88 | 94.81 | 99.11 | 100.29 |
i1-Q5_K_S | 5 | 98.61 | 99.89 | 94.61 | 99.04 | 100.17 |
i1-Q5_1 | 6 | 98.69 | 99.85 | 94.73 | 98.97 | 100.17 |
i1-Q5_0 | 7 | 98.45 | 99.92 | 94.39 | 98.71 | 100.33 |
Q5_K_M | 8 | 98.62 | 99.84 | 94.56 | 99.15 | 100.68 |
Q5_K_S | 9 | 98.35 | 99.79 | 94.15 | 99.02 | 100.78 |
Q5_0 | 10 | 98.12 | 99.80 | 93.86 | 98.37 | 100.25 |
Q5_1 | 11 | 98.23 | 99.76 | 93.95 | 98.78 | 100.00 |
i1-Q4_K_M | 12 | 96.76 | 99.65 | 92.52 | 97.98 | 100.33 |
i1-IQ4_NL | 13 | 96.13 | 99.68 | 91.88 | 97.67 | 100.80 |
i1-Q4_K_S | 14 | 96.23 | 99.57 | 92.06 | 97.64 | 100.81 |
i1-Q4_1 | 15 | 96.28 | 99.53 | 92.07 | 97.60 | 99.86 |
i1-IQ4_XS | 16 | 96.08 | 99.68 | 91.86 | 97.59 | 100.30 |
Q4_K_M | 17 | 96.34 | 99.45 | 92.02 | 97.39 | 99.86 |
IQ4_NL | 18 | 95.63 | 99.43 | 91.26 | 97.43 | 100.61 |
IQ4_XS | 19 | 95.54 | 99.43 | 91.10 | 97.41 | 100.23 |
Q4_K_S | 20 | 95.61 | 99.29 | 91.24 | 97.21 | 100.39 |
i1-Q4_0 | 21 | 94.95 | 99.30 | 90.73 | 96.83 | 101.10 |
Q4_1 | 22 | 94.59 | 98.99 | 90.37 | 96.68 | 100.62 |
Q4_0 | 23 | 93.95 | 99.01 | 89.78 | 96.10 | 99.66 |
i1-Q3_K_L | 24 | 92.02 | 98.94 | 88.87 | 95.02 | 101.33 |
i1-Q3_K_M | 25 | 91.04 | 98.83 | 88.25 | 94.36 | 100.47 |
Q3_K_L | 26 | 90.61 | 98.35 | 87.79 | 93.72 | 100.04 |
Q3_K_M | 27 | 89.18 | 98.02 | 86.84 | 92.70 | 99.33 |
i1-IQ3_S | 28 | 88.82 | 97.86 | 86.80 | 91.44 | 99.10 |
i1-IQ3_M | 29 | 88.79 | 97.68 | 86.80 | 91.12 | 99.26 |
i1-IQ3_XS | 30 | 86.52 | 97.64 | 85.62 | 90.24 | 100.26 |
i1-Q3_K_S | 31 | 85.81 | 97.48 | 84.60 | 89.77 | 101.30 |
Q3_K_S | 32 | 84.02 | 97.05 | 83.70 | 87.78 | 99.08 |
i1-IQ3_XXS | 33 | 82.46 | 97.23 | 83.37 | 87.10 | 100.64 |
IQ3_M | 34 | 78.93 | 96.12 | 81.08 | 81.98 | 96.38 |
IQ3_S | 35 | 76.56 | 96.17 | 80.20 | 80.53 | 97.68 |
i1-Q2_K | 36 | 74.42 | 95.98 | 79.69 | 79.69 | 99.42 |
IQ3_XS | 37 | 74.06 | 95.73 | 79.22 | 78.69 | 98.12 |
i1-IQ2_M | 38 | 71.61 | 95.50 | 78.65 | 76.56 | 99.59 |
i1-Q2_K_S | 39 | 68.24 | 95.59 | 78.06 | 72.98 | 97.17 |
Q2_K | 40 | 66.38 | 94.33 | 76.91 | 69.93 | 96.08 |
i1-IQ2_S | 41 | 63.37 | 94.26 | 75.83 | 68.29 | 98.92 |
i1-IQ2_XS | 42 | 61.16 | 93.79 | 75.23 | 65.91 | 95.93 |
i1-IQ2_XXS | 43 | 51.61 | 92.11 | 72.35 | 53.91 | 95.04 |
i1-IQ1_M | 44 | 24.26 | 87.08 | 64.29 | 10.70 | 87.40 |
i1-IQ1_S | 45 | 5.82 | 82.99 | 60.06 | -26.44 | 80.70 |
Qwen2.5-72B-Instruct
Quant | Rank | KL-divergence | Correct token | Same token | Perplexity | Eval |
---|---|---|---|---|---|---|
Q8_0 | 1 | 99.67 | 99.99 | 97.20 | 99.58 | 99.96 |
i1-Q6_K | 2 | 99.48 | 99.97 | 96.54 | 99.55 | 99.85 |
Q6_K | 3 | 99.43 | 99.96 | 96.43 | 99.51 | 99.81 |
i1-Q5_K_M | 4 | 98.73 | 99.86 | 95.28 | 99.06 | 100.10 |
i1-Q5_K_S | 5 | 98.48 | 99.80 | 94.92 | 98.88 | 99.77 |
i1-Q5_1 | 6 | 98.53 | 99.79 | 95.02 | 98.80 | 100.18 |
Q5_K_M | 7 | 98.48 | 99.79 | 94.95 | 98.76 | 99.88 |
i1-Q5_0 | 8 | 98.27 | 99.75 | 94.68 | 98.75 | 99.66 |
Q5_K_S | 9 | 97.91 | 99.76 | 94.26 | 98.58 | 99.27 |
Q5_0 | 10 | 97.91 | 99.64 | 94.21 | 98.13 | 98.75 |
Q5_1 | 11 | 97.83 | 99.65 | 94.17 | 98.24 | 99.71 |
i1-Q4_K_M | 12 | 97.14 | 99.55 | 93.62 | 97.52 | 98.18 |
i1-Q4_K_S | 13 | 96.87 | 99.53 | 93.38 | 97.35 | 98.41 |
i1-IQ4_NL | 14 | 95.95 | 99.38 | 92.49 | 97.30 | 99.92 |
Q4_K_M | 15 | 96.67 | 99.33 | 93.05 | 96.73 | 99.07 |
Q4_K_S | 16 | 96.13 | 99.33 | 92.56 | 96.48 | 98.99 |
i1-IQ4_XS | 17 | 95.90 | 99.37 | 92.39 | 97.34 | 100.54 |
i1-Q4_1 | 18 | 95.82 | 99.20 | 92.51 | 96.43 | 98.20 |
IQ4_NL | 19 | 94.98 | 99.14 | 91.60 | 96.02 | 99.88 |
IQ4_XS | 20 | 94.94 | 99.13 | 91.54 | 96.07 | 99.72 |
i1-Q4_0 | 21 | 94.38 | 99.12 | 91.19 | 96.73 | 100.21 |
Q4_0 | 22 | 93.37 | 99.02 | 90.49 | 95.54 | 100.15 |
Q4_1 | 23 | 93.00 | 98.64 | 90.33 | 94.63 | 100.65 |
i1-Q3_K_L | 24 | 91.41 | 98.71 | 89.54 | 93.31 | 99.21 |
i1-Q3_K_M | 25 | 91.18 | 98.68 | 89.34 | 93.17 | 98.38 |
i1-Q3_K_S | 26 | 89.80 | 98.43 | 88.53 | 92.31 | 99.12 |
i1-IQ3_M | 27 | 90.98 | 97.59 | 89.19 | 93.55 | 99.42 |
i1-IQ3_S | 28 | 90.83 | 97.76 | 89.08 | 93.42 | 99.38 |
Q3_K_L | 29 | 89.32 | 97.97 | 88.06 | 91.06 | 98.96 |
Q3_K_M | 30 | 89.10 | 97.93 | 87.94 | 90.73 | 99.50 |
i1-IQ3_XS | 31 | 87.65 | 97.27 | 87.44 | 91.68 | 98.25 |
Q3_K_S | 32 | 87.43 | 97.59 | 86.96 | 89.58 | 98.52 |
i1-IQ3_XXS | 33 | 85.57 | 97.23 | 86.06 | 89.89 | 99.02 |
IQ3_S | 34 | 85.03 | 96.22 | 85.84 | 88.29 | 98.72 |
IQ3_M | 35 | 85.37 | 96.07 | 85.94 | 88.02 | 98.51 |
i1-Q2_K | 36 | 75.42 | 96.07 | 82.28 | 78.42 | 97.41 |
IQ3_XS | 37 | 80.82 | 95.63 | 83.98 | 85.63 | 97.02 |
i1-IQ2_M | 38 | 77.30 | 95.19 | 82.65 | 81.31 | 99.08 |
i1-Q2_K_S | 39 | 73.62 | 95.69 | 81.62 | 76.57 | 99.21 |
i1-IQ2_S | 40 | 71.05 | 94.12 | 80.17 | 75.63 | 99.09 |
i1-IQ2_XS | 41 | 68.93 | 93.80 | 79.49 | 73.20 | 98.75 |
Q2_K | 42 | 65.26 | 93.61 | 78.30 | 64.01 | 97.44 |
i1-IQ2_XXS | 43 | 61.76 | 92.26 | 76.92 | 64.32 | 94.72 |
i1-IQ1_M | 44 | 42.36 | 89.31 | 70.34 | 38.29 | 94.10 |
i1-IQ1_S | 45 | 34.59 | 87.53 | 68.04 | 28.83 | 94.58 |
@nicoboss It is interesting to see that Apple Silicon may be special in requiring more energy for K-quants. However, this should not diminish the value provided by traditional quants, especially since Apple Silicon is a popular platform for LLMs, and their machines are a relatively easy way to get large amounts of fast memory.
I would also like to point out that Q4_1 being the largest Q4 quant is not necessarily a downside. With large amounts of unified memory available, the larger size is not usually an issue, unlike with graphics cards. The main limitation is actually energy efficiency because of thermal constraints, which is why I specifically tested for Joules per token.
Let's look on some CUDA performance numbers measured on an AMD Ryzen Threadripper 1950X with the entire model offloaded to an RTX 2070s. Even here Q4_1 performance seems quite disappointing:
model | size | params | backend | ngl | threads | test | t/s |
---|---|---|---|---|---|---|---|
qwen2 7B Q4_0 | 4.13 GiB | 7.62 B | CUDA | 999 | 16 | pp512 | 2210.24 Β± 4.85 |
qwen2 7B Q4_0 | 4.13 GiB | 7.62 B | CUDA | 999 | 16 | tg128 | 79.22 Β± 0.02 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | CUDA | 999 | 16 | pp512 | 2054.47 Β± 1.20 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | CUDA | 999 | 16 | tg128 | 77.24 Β± 0.02 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | CUDA | 999 | 16 | pp512 | 2050.95 Β± 1.59 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | CUDA | 999 | 16 | tg128 | 76.55 Β± 0.02 |
Let's finally look on some AMD laptop with an 7840s SoC performance which only has an integrated GPU. Even here Q4_1 didn't realy convince me. And this SoC is fully power limited so performance = power efficiency.
Offload
model | size | params | backend | ngl | threads | test | t/s |
---|---|---|---|---|---|---|---|
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | Vulkan,RPC | 999 | 4 | pp128 | 116.60 Β± 1.40 |
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | Vulkan,RPC | 999 | 4 | tg128 | 11.36 Β± 0.31 |
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | Vulkan,RPC | 999 | 6 | pp128 | 115.46 Β± 1.71 |
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | Vulkan,RPC | 999 | 6 | tg128 | 10.76 Β± 0.17 |
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | Vulkan,RPC | 999 | 8 | pp128 | 125.98 Β± 0.56 |
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | Vulkan,RPC | 999 | 8 | tg128 | 12.69 Β± 0.98 |
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | Vulkan,RPC | 999 | 12 | pp128 | 120.93 Β± 0.60 |
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | Vulkan,RPC | 999 | 12 | tg128 | 13.60 Β± 0.48 |
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | Vulkan,RPC | 999 | 16 | pp128 | 119.00 Β± 0.45 |
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | Vulkan,RPC | 999 | 16 | tg128 | 13.92 Β± 0.13 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | Vulkan,RPC | 999 | 4 | pp128 | 96.64 Β± 0.31 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | Vulkan,RPC | 999 | 4 | tg128 | 10.22 Β± 0.21 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | Vulkan,RPC | 999 | 6 | pp128 | 101.01 Β± 2.10 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | Vulkan,RPC | 999 | 6 | tg128 | 10.30 Β± 0.51 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | Vulkan,RPC | 999 | 8 | pp128 | 105.57 Β± 0.52 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | Vulkan,RPC | 999 | 8 | tg128 | 11.95 Β± 0.61 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | Vulkan,RPC | 999 | 12 | pp128 | 102.94 Β± 0.52 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | Vulkan,RPC | 999 | 12 | tg128 | 12.57 Β± 0.14 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | Vulkan,RPC | 999 | 16 | pp128 | 102.14 Β± 0.48 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | Vulkan,RPC | 999 | 16 | tg128 | 12.64 Β± 0.05 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | Vulkan,RPC | 999 | 4 | pp128 | 111.03 Β± 2.31 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | Vulkan,RPC | 999 | 4 | tg128 | 10.51 Β± 0.30 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | Vulkan,RPC | 999 | 6 | pp128 | 112.41 Β± 4.07 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | Vulkan,RPC | 999 | 6 | tg128 | 9.85 Β± 0.05 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | Vulkan,RPC | 999 | 8 | pp128 | 121.61 Β± 0.14 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | Vulkan,RPC | 999 | 8 | tg128 | 10.93 Β± 0.56 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | Vulkan,RPC | 999 | 12 | pp128 | 116.38 Β± 0.32 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | Vulkan,RPC | 999 | 12 | tg128 | 12.15 Β± 0.47 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | Vulkan,RPC | 999 | 16 | pp128 | 114.16 Β± 0.62 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | Vulkan,RPC | 999 | 16 | tg128 | 12.28 Β± 0.18 |
No offload:
model | size | params | backend | ngl | threads | test | t/s |
---|---|---|---|---|---|---|---|
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | Vulkan,RPC | 0 | 4 | pp128 | 91.64 Β± 1.23 |
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | Vulkan,RPC | 0 | 4 | tg128 | 6.26 Β± 0.21 |
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | Vulkan,RPC | 0 | 6 | pp128 | 87.01 Β± 4.10 |
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | Vulkan,RPC | 0 | 6 | tg128 | 8.12 Β± 0.19 |
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | Vulkan,RPC | 0 | 8 | pp128 | 84.29 Β± 3.42 |
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | Vulkan,RPC | 0 | 8 | tg128 | 9.60 Β± 0.23 |
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | Vulkan,RPC | 0 | 12 | pp128 | 78.57 Β± 2.24 |
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | Vulkan,RPC | 0 | 12 | tg128 | 9.68 Β± 0.22 |
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | Vulkan,RPC | 0 | 16 | pp128 | 75.69 Β± 2.16 |
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | Vulkan,RPC | 0 | 16 | tg128 | 9.92 Β± 0.11 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | Vulkan,RPC | 0 | 4 | pp128 | 82.06 Β± 3.72 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | Vulkan,RPC | 0 | 4 | tg128 | 8.12 Β± 0.52 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | Vulkan,RPC | 0 | 6 | pp128 | 81.88 Β± 1.19 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | Vulkan,RPC | 0 | 6 | tg128 | 9.71 Β± 0.08 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | Vulkan,RPC | 0 | 8 | pp128 | 78.67 Β± 0.93 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | Vulkan,RPC | 0 | 8 | tg128 | 9.97 Β± 0.01 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | Vulkan,RPC | 0 | 12 | pp128 | 74.40 Β± 0.88 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | Vulkan,RPC | 0 | 12 | tg128 | 9.59 Β± 0.08 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | Vulkan,RPC | 0 | 16 | pp128 | 71.01 Β± 1.47 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | Vulkan,RPC | 0 | 16 | tg128 | 9.60 Β± 0.07 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | Vulkan,RPC | 0 | 4 | pp128 | 88.09 Β± 2.42 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | Vulkan,RPC | 0 | 4 | tg128 | 5.15 Β± 0.15 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | Vulkan,RPC | 0 | 6 | pp128 | 87.39 Β± 1.05 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | Vulkan,RPC | 0 | 6 | tg128 | 7.00 Β± 0.07 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | Vulkan,RPC | 0 | 8 | pp128 | 82.96 Β± 0.15 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | Vulkan,RPC | 0 | 8 | tg128 | 8.08 Β± 0.11 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | Vulkan,RPC | 0 | 12 | pp128 | 77.33 Β± 1.95 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | Vulkan,RPC | 0 | 12 | tg128 | 8.77 Β± 0.08 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | Vulkan,RPC | 0 | 16 | pp128 | 74.92 Β± 1.20 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | Vulkan,RPC | 0 | 16 | tg128 | 9.28 Β± 0.04 |
CPU
model | size | params | backend | ngl | threads | test | t/s |
---|---|---|---|---|---|---|---|
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | RPC | 0 | 4 | pp128 | 8.60 Β± 0.01 |
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | RPC | 0 | 4 | tg128 | 5.84 Β± 0.28 |
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | RPC | 0 | 6 | pp128 | 12.81 Β± 0.01 |
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | RPC | 0 | 6 | tg128 | 7.77 Β± 0.11 |
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | RPC | 0 | 8 | pp128 | 16.99 Β± 0.04 |
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | RPC | 0 | 8 | tg128 | 9.00 Β± 0.27 |
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | RPC | 0 | 12 | pp128 | 24.22 Β± 0.13 |
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | RPC | 0 | 12 | tg128 | 9.51 Β± 0.05 |
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | RPC | 0 | 16 | pp128 | 32.43 Β± 0.36 |
qwen2 7B Q4_0 | 4.12 GiB | 7.62 B | RPC | 0 | 16 | tg128 | 9.49 Β± 0.01 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | RPC | 0 | 4 | pp128 | 12.19 Β± 1.29 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | RPC | 0 | 4 | tg128 | 8.13 Β± 0.18 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | RPC | 0 | 6 | pp128 | 15.81 Β± 0.64 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | RPC | 0 | 6 | tg128 | 9.65 Β± 0.06 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | RPC | 0 | 8 | pp128 | 18.52 Β± 1.00 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | RPC | 0 | 8 | tg128 | 9.88 Β± 0.06 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | RPC | 0 | 12 | pp128 | 25.04 Β± 0.43 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | RPC | 0 | 12 | tg128 | 9.70 Β± 0.07 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | RPC | 0 | 16 | pp128 | 31.89 Β± 0.06 |
qwen2 7B Q4_K - Small | 4.15 GiB | 7.62 B | RPC | 0 | 16 | tg128 | 9.41 Β± 0.05 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | RPC | 0 | 4 | pp128 | 7.02 Β± 0.18 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | RPC | 0 | 4 | tg128 | 5.23 Β± 0.08 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | RPC | 0 | 6 | pp128 | 9.35 Β± 1.25 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | RPC | 0 | 6 | tg128 | 6.89 Β± 0.13 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | RPC | 0 | 8 | pp128 | 10.67 Β± 0.58 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | RPC | 0 | 8 | tg128 | 7.84 Β± 0.08 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | RPC | 0 | 12 | pp128 | 13.67 Β± 0.25 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | RPC | 0 | 12 | tg128 | 8.67 Β± 0.09 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | RPC | 0 | 16 | pp128 | 16.88 Β± 0.01 |
qwen2 7B Q4_1 | 4.53 GiB | 7.62 B | RPC | 0 | 16 | tg128 | 8.97 Β± 0.01 |
@nicoboss Again, you are measuring on different platforms with completely different architectures. Your insistence that Q4_1 is disappointing does not disprove the value it has on Apple Silicon. GPU users have so many quants available to choose from depending on what fits in their VRAM, why should Apple Silicon users not be provided with the same choice based on what works best for their machines? Especially now that Q4_0_4_4, Q4_0_4_8 and Q4_0_8_8 are no longer necessary, it would not be a strain of resources to replace three quants with one.
I understand anti-Apple sentiment but restricting user choice is not the answer.
@yttria
Can you please run the following on Q4_0,
Q4_K_S
and Q4_1
? I would expect performance of Q4_1 to be better than Q4_0 and Q4_S on your Apple silicon should it really be able to process Q4_1 so much more efficiently. Please use latest llama.cpp which includes online conversion to optimize some Q4 quants to your CPU.
./llama-bench -m $modelpath -v --prio 3 -t 4,6,8,12,16 > ./perfOut.txt
Especially now that Q4_0_4_4, Q4_0_4_8 and Q4_0_8_8 are no longer necessary, it would not be a strain of resources to replace three quants with one.
@yttria I though online repacking as implemented in https://github.com/ggerganov/llama.cpp/pull/10446 uses Q4_0 and not Q4_1 as base. If online repacking really does use Q4_1 it would obviously instantly get added because ARM/RISC-V optimization is a huge deal.
llama.cpp now having online repacking is also one of the reasons why I'm quite skeptical that Q4_1 is really better on apple silicon compared to an online repacked ARM optimised Q4_0.
I understand anti-Apple sentiment but restricting user choice is not the answer.
I'm not anti-Apple but before adding a new quant we need to be sure it really is significantly better on Apple silicon because it seems much worse for every other hardware. I fully support adding Q4_1 quants if they turn out to be the best choice for Apple silicon users.
I again checked the quality table on https://hf.tst.eu/model#Qwen2.5-3B-i1-GGUF and using imatirx/wighted quants i1-Q4_1 has a quality 87 which is equal to i1-Q4_K_S. While still being worse in a quality/size ratio it is a massive improvement compared to Q4_1 static quants and so seams worth it to be added for imatrix quants if Q4_1 is the best choice for Apple silicon users.
I don't have time to test right now (will probably in the next few days), but I found this person also did tests on Apple Silicon.
https://beebopkim.github.io/2024/03/09/Benchmarks-for-lots-of-quantization-types-in-llama-cpp/
I don't quite agree with his idea of taking pp * tg / ppl, so I plotted pp * tg against ppl with his data. In the graph below, the top left corner is optimal. It is clear that Q4_K and Q4_K_S are worse than the other Q4 quants.
Note that he didn't test power consumption, and from my tests Q4_0 and Q4_1 use much less power than both K-quants and I-quants. So taking power consumption into account, Q4_0 and Q4_1 are the best quants for Mac, with Q4_1 using ~4% more power but giving better perplexity.
Now let's see if it's worth paying 4% more for Q4_1.
I found the only reasonable quants to use on Mac are Q4_0, Q4_1 and Q8_0.
According to https://hf.tst.eu, quality of quants are as follows. β(Js)/T values are from my previous test of 7B models.
Quant | Quality | β(Js)/T |
---|---|---|
Q4_0 | 84 | 100% |
Q4_1 | 87 | 104% |
Q8_0 | 99 | 129% |
Q4_1 has 3 more quality than Q4_0 for only 4% more β(Js)/T, meaning each quality costs 1.3%. Q8_0 has 12 more than Q4_1 but requires 25% more β(Js)/T, with each quality costing 2.1%. Therefore, it is worth paying 4% more β(Js)/T for Q4_1 rather than Q4_0.
Q4_1 has 3 more quality than Q4_0 for only 4% more β(Js)/T, meaning each quality costs 1.3%. Q8_0 has 12 more than Q4_1 but requires 25% more β(Js)/T, with each quality costing 2.1%. Therefore, it is worth paying 4% more β(Js)/T for Q4_1 rather than Q4_0.
Keep in mind that this only applies for imatrix/weighted quants. For static quants Q4_1 has with only 80 the worst quality of all Q4 quants even getting beaten by Q4_0 scoring 81 quality. In this case the 4% more power required for it makes it objectively a worse option compared to static Q4_0.
Here is the full plot of all quants q4 and above. Top left is optimal. Only Q4_0, Q4_1 and Q8_0 are on the frontier.
This plot is very convincing and shows how imatrix/weighted i1-Q4_1 and i1-Q4_K_S achieve the same quality while i1-Q4_1 using significantly less power on Apple silicon. I just hope Q4_1 is also faster than Q4_0 on Apple silicon but it probably will be given that you mentioned that the SoC is thermally limited.
@mradermacher With all the data provided so far I recommend adding i1-Q4_1 to imatrix/weighted quants but recommend against adding Q4_1 to static quants.
Wow, thanks to all of you for your data driven arguments, which are the second-best arguments around. Let's say you all convinced me:
The original reason we dropped the Q4_1 quant was its erratic performance, not unlike the IQ3 static quants. I think we have some good evidence that the worst issues with it have gone in the last two months, especially with the imatrix one. As such, I am strongly contemplating bringing back the static iq3 quants, knowing that their good qwen results might have been a fluke.
That the arm quants are gone is not an argument in favour of Q4_1 per se, but it makes the decision to re-add it easier (because I'd like to avoid outright wasting of resources, which was a strong factor against the arm quants. Good riddance).
That apple is a sucky company and should not be supported is very true, IMnsHO, and so are nvidia, google, and also intel, amd, microsoft, oracle... Now, SGI and Cray on the other hand, those were pretty cool companies!
(And now I am concerned that IQ4_NL actually looks quite good, at least on apple :)
That Q4_1 is computationally efficient to quantize, comparatively, also makes this easier.
That size doesn't matter much on apple is surprising to me - yes, memory bandwidth is higher, but Q4_1 is simple enough for it to still be memory-bandwidth-limited, one would naively assume.
So long story short, I've added Q4_1 to the list of imatrix quants to generate. It's not ideal, because I am still worried that good Q4_1 results might be flukes, but barring even more data, this seems indeed the correct decision. I don't think I want a static Q4_1 (unless I only generate static quants for a model, but I do not have logic for that in place yet).
And if somebody wants to add data for more varied models, I wouldn't be unhappy :)
I hope that flies well with everybody.
And now I am concerned that IQ4_NL actually looks quite good, at least on apple :)
@mradermacher IQ4_NL is relatively important one due to https://github.com/ggerganov/llama.cpp/pull/10541 which implements runtime repack into IQ4_NL_4_4 for ARM NEON superseding the never merged IQ4_NL_X_X quants described in https://github.com/ggerganov/llama.cpp/pull/10196": "Q4_0_X_X is very fast but the accuracy of Q4_0 is not good. IQ4_NL is much better than Q4_0 and they have compatible structure. Therefore, I introduce IQ4_NL_X_X to have benefits of both.". If you look at the performance results posted in https://github.com/ggerganov/llama.cpp/pull/10196 I have to say I'm really impressed with IQ4_NL_4_4. I recommend we provide weighted/imatrix IQ4_NL quants for small models similar to how we did for ARM quants.
for small models
Your argument overall is persuasive, but indeed, for small models the barrier is much lower. I'll implement that as soon as I can.
IQ4_NL should be done, currently at <= 18B.