oh, yeah, of course.. I added all the ARM quants but then not Q4_0 which is now the only one that would work haha..
I'll go any make a Q4_0 for it I suppose ! just this once
oh, yeah, of course.. I added all the ARM quants but then not Q4_0 which is now the only one that would work haha..
I'll go any make a Q4_0 for it I suppose ! just this once
Don't love adding more formats but if your results are accurate it does seem worth including
I've updated it to "Legacy format, offers online repacking for ARM and AVX CPU inference.", it is still overall legacy but with the online repacking is worth considering for speed
I'm hoping that IQ4_NL gets a few more packing options in the near future
hell yeah. wish we could still offline compile, i get why it's not sustainable in the future but also until there's better support and more options would be nice to keep it around
oh right sorry, forgot to include that PR, i'll add it above but it's here:
https://github.com/ggerganov/llama.cpp/pull/10541
I think the inference engines will just need to update to the newer versions and they'll get the repacking logic for free, if that's what you meant then yes
This makes perfect sense, average users definitely don't need to be uploading that much stuff privately, great for testing but if it's not worth releasing publicly it's not worth storing on servers for free :)
Great update !
for what it's worth, it seems like these "limits" always existed but are now just public, they always let people blow through them and gave grants to accounts that were contributing to the community
you can read up VB's response on reddit here:
But TLDR don't worry about it, this shouldn't interfere with anyone who's using the platform legitimately
The test mark was after initial upload and after people pointed it out :) glad it is a good label though
This argument really doesn't make any sense to me.. surely if you're aiming for the most accurate overall representation anyone can see that gathering as many data points across a diverse area would yield the most useful results? Sure ideally your single light will probably get a reasonably close overall value.. but also it might not?
Additionally, I think his point was that you don't necessarily want to increase performance against a given corpus, but rather increase faithfulness to the original model against a given corpus
You may be able to keep PPL the same or better than the original while simultaneously veering far from what the original model would have generated, which while great for that corpus of text, is not what the intention of the quantization itself is (in fact many people worry about this a lot, fearing that the quantization will favour the text used as a reference, which I'm luckily seeing is not what happens at least for imatrix)
The fact that 2 models can have identical PPL scores yet generate completely different text should be proof enough that PPL only tells a tiny part of a story. Yes it's good to know the model is good, but when quantizing I don't need to know how good it is, I need to know how similar it is to the original.
I suppose that's reasonable, I guess why I like KLD more is that I breaks it down into percentages, like mean, max, 99.99%, etc etc, where PPL is just a single all encompassing number that's more difficult to interpret
I don't know if I can put much value into IQ6 outperforming fp16 because lately we've been seeing benchmarks where Q3 beats bf16, so while useful I don't know that they can't definitively tell us quant quality, but I do think it's a good proof of competency
This is why KLD to me provides at least a slightly clearer image of how well the quantization does at recreating the original model. I see what you're saying still about PPL but (at least how llama.cpp does it) KLD gives a more thorough look. That and TOP p is nice to see how often the models agree on the token
That's not an invalid point, but also when the final goal is quantization that 0.03% is negligible compared to the rest of the losses.
If you're talking about running at full precision, yeah, bf16 > fp16 by all means
I'd also prefer to see KLD of fp16 vs bf16 since PPL is, to me, pretty meaningless. I'm sure it has value and probably more than I give it, but unless it's PPL against the dataset it was trained on I don't really find much merit to it.
I appreciate the breakdown though, and even 0.4% is not enough to worry me when again the final goal is quantization, not to run it at that DTYPE.
To that end, do you happen to know if when quantizing from BF16.. does it get converted to FP16 first? Does it even matter? BF16 -> Q8 vs BF16 -> FP16 -> Q8, I wonder how different it would be. Gut instinct says it's in the 0.01% range.
Bf16 can't be offloaded to GPUs so imatrix becomes slow to make :')
I suppose I should add, that this is more valuable as a pseudo comparison to bf16
Since bf16 can represent the range (1, -1) with more precision than fp16, there is much debate as to whether it's safe to convert from bf16 to fp16, or if you should keep bf16, or even upcast to fp32, in order to preserve the original quality of the model for as long as possible before quantizing to 8 bits
This test shows that fp16 is capable of represent 99.97% of the weights in an FP32 model precisely, and therefore represents a negligible at best difference
Additionally, since the weights it can't represent are between 6e-5 and -6e-5, the weights it can't represent are so small that they most likely do not contribute to the finally output of the model and are relatively safe to prune