Does this use imatrix?
Seems imatrix can improve quants, does this use it? At 2 bits as expected it sometimes acts a bit strange, but I have seen other quants of similar sized models do better. I haven't tried 3 bits yet though might be able to squeeze it into 64GBs of RAM. Also, I think imatrix needs a dataset, ideally should be in all the languages that this model supports well. But, I imagine that it would greatly help with cases like this even with a lazy dataset
These are regular quants (without imatrix)
Could you tell me which llama.cpp fork you use and what the SHA-256 hash of the weights is? There was an issue with F16 token embeddings, and I would like to make sure that this is not related to it
Well seems to always be periods a lot of the time it struggles with oddly, also seems to always pick similar words it will replace it with. Ignore the broken fonts that's a issue with my terminal with CJK languages not sure what it is
Anyway, as for the sum it's 47e139a57872a72096c05b043b1ec6c2f08451da7df0d84d45168708667b98f5 ./models/command-r-plus-Q2_K.gguf
and I am running this https://github.com/ggerganov/llama.cpp/pull/6491 at commit d2924073ee9bdd600d22ded4e2d5fe30e69783a7
Tried Q3 big improvement and still have more spare memory than I thought I would
I have some early perplexity results on wikitext-2-raw
and they seem to confirm this improvement
Test | PPL Value | Standard Deviation |
---|---|---|
Q2_K | 5.7178 | +/- 0.03418 |
Q3_K_L | 4.6214 | +/- 0.02629 |
Q4_K_M | 4.4625 | +/- 0.02522 |