Q2_K_XL model is the best? IQ2_XXS is better than Q2_K_XL in mmlu-pro benchmark

#36

by albertchow - opened 6 days ago

Discussion

albertchow

6 days ago

YudingOOM

5 days ago

where can we get the mmlu-pro benchmark for Q2_K_XL and IQ2_XXS?

Rotating

5 days ago

•

edited 5 days ago

Without the data we can't tell. It could just be randomness because we're running R1 at 0.6 temp and it's doing reasoning.

albertchow

5 days ago

•

edited 2 days ago

well, I only tested computer science (410 questions) with zero-shot by following usage recommendations mentioned in DeepSeek-R1 page, the score of Q2_K_XL is 80.24, and the score of IQ2_XXS is 84.15.
the test time is very long, therefore, I only tested once for each quant.

the following is the example of the first question and the corresponding response:

Rotating

4 days ago

well, I only tested computer science (410 questions) with zero-shot by following usage recommendations mentioned in DeepSeek-R1 page, the score of Q2_K_XL is 80.24, and the score of IQ2_XXS is 84.15.
the test time is very long, therefore, I only tested once for each quant.

I remember something similar happening consistently back in the day on some old model where a lower quant somehow got one question right where the q8 didn't.

albertchow

4 days ago

•

edited 4 days ago

I remember something similar happening consistently back in the day on some old model where a lower quant somehow got one question right where the q8 didn't.

The MMLU-Pro computer science subject has 410 questions, so a few outliers have little impact on the final scores.

Rotating

4 days ago

I remember something similar happening consistently back in the day on some old model where a lower quant somehow got one question right where the q8 didn't.

The MMLU-Pro computer science subject has 410 questions, so a few outliers have little impact on the final scores.

It's interesting for sure. For what it's worth, the perplexity scores don't seem to indicate a problem with the quant though. https://huggingface.co./unsloth/DeepSeek-R1-GGUF/discussions/37

albertchow

4 days ago

•

edited 4 days ago

well, I test R1 distill Qwen-32B model on MMLU-Pro computer science by repeating 8 times, and you can see Q3_K_M even has large model size and lower perplexity, the score is still worse than IQ3_XS. (the GGUF models provided by bartowski)

Rotating

4 days ago

well, I test R1 distill Qwen-32B model on MMLU-Pro computer science by repeating 8 times, and you can see Q3_K_M even has large model size and lower perplexity, the score is still worse than IQ3_XS. (the GGUF models provided by bartowski)

The IQ quant being better isn't a surprise, and the perplexity difference is within the margin of uncertainty. Having said all that, if I hadn't already downloaded q2_k_xl I might've gone for the 2xxs to be faster and at least in one test be better at computer science. I wonder what IQ3_M would get on mmlu pro computer science.

Rotating

2 days ago

It turns out xxs is using imatrix and q2_k_xl isn't. That might be the factor involved.

albertchow

2 days ago

•

edited 2 days ago

As for IQ3_M provided by bartowski, I cannot use max_tokens = 8192 as the smaller models (e.g., IQ2_XXS) did because out of memory, thus, the comparison of IQ3_M is a bit of unfair.
The score of IQ3_M is around 80 (still running, unfinished) under max_tokens = 6300
Therefore, I would like to say the IQ2_XXS provided by unsloth is pretty good.

shimmyshimmer

Unsloth AI org 2 days ago

As for IQ3_M provided by bartowski, I cannot use max_tokens = 8192 as the smaller models (e.g., IQ2_XXS) did because out of memory, thus, the comparison of IQ3_M is a bit of unfair.
The score of IQ3_M is around 80 (still running, unfinished) under max_tokens = 6300
Therefore, I would like to say the IQ2_XXS provided by unsloth is pretty good.

Great stuff thanks for confirming! :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment