Q2_K_XL model is the best? IQ2_XXS is better than Q2_K_XL in mmlu-pro benchmark
where can we get the mmlu-pro benchmark for Q2_K_XL and IQ2_XXS?
Without the data we can't tell. It could just be randomness because we're running R1 at 0.6 temp and it's doing reasoning.
well, I only tested computer science (410 questions) with zero-shot by following usage recommendations mentioned in DeepSeek-R1 page, the score of Q2_K_XL is 80.24, and the score of IQ2_XXS is 84.15.
the test time is very long, therefore, I only tested once for each quant.
the following is the example of the first question and the corresponding response:
well, I only tested computer science (410 questions) with zero-shot by following usage recommendations mentioned in DeepSeek-R1 page, the score of Q2_K_XL is 80.24, and the score of IQ2_XXS is 84.15.
the test time is very long, therefore, I only tested once for each quant.
I remember something similar happening consistently back in the day on some old model where a lower quant somehow got one question right where the q8 didn't.
I remember something similar happening consistently back in the day on some old model where a lower quant somehow got one question right where the q8 didn't.
The MMLU-Pro computer science subject has 410 questions, so a few outliers have little impact on the final scores.
I remember something similar happening consistently back in the day on some old model where a lower quant somehow got one question right where the q8 didn't.
The MMLU-Pro computer science subject has 410 questions, so a few outliers have little impact on the final scores.
It's interesting for sure. For what it's worth, the perplexity scores don't seem to indicate a problem with the quant though. https://huggingface.co./unsloth/DeepSeek-R1-GGUF/discussions/37
well, I test R1 distill Qwen-32B model on MMLU-Pro computer science by repeating 8 times, and you can see Q3_K_M even has large model size and lower perplexity, the score is still worse than IQ3_XS. (the GGUF models provided by bartowski)
The IQ quant being better isn't a surprise, and the perplexity difference is within the margin of uncertainty. Having said all that, if I hadn't already downloaded q2_k_xl I might've gone for the 2xxs to be faster and at least in one test be better at computer science. I wonder what IQ3_M would get on mmlu pro computer science.
It turns out xxs is using imatrix and q2_k_xl isn't. That might be the factor involved.
As for IQ3_M provided by bartowski, I cannot use max_tokens = 8192 as the smaller models (e.g., IQ2_XXS) did because out of memory, thus, the comparison of IQ3_M is a bit of unfair.
The score of IQ3_M is around 80 (still running, unfinished) under max_tokens = 6300
Therefore, I would like to say the IQ2_XXS provided by unsloth is pretty good.
As for IQ3_M provided by bartowski, I cannot use max_tokens = 8192 as the smaller models (e.g., IQ2_XXS) did because out of memory, thus, the comparison of IQ3_M is a bit of unfair.
The score of IQ3_M is around 80 (still running, unfinished) under max_tokens = 6300
Therefore, I would like to say the IQ2_XXS provided by unsloth is pretty good.
Great stuff thanks for confirming! :)