Check out an alternate quantization...
https://huggingface.co./ZeroWw/NeuralDaredevil-8B-abliterated-GGUF
(you can find more like this in my profile)
These are my own quantizations (updated almost daily).
output and embed tensors quantized to f16.
all other tensors quantized to q5_k or q6_k.
Result:
both f16.q6 and f16.q5 are smaller than q8_0 standard quantization
and they perform as well as the pure f16.
Hey thanks, can you elaborate on how you managed to improve the performance of these quants?
Hey thanks, can you elaborate on how you managed to improve the performance of these quants?
Sure: instead of quantizing everything in the same way, I quantized the output and embed tensors to f16 and all the other tensors to q5,q6 and q8.
The f16/q6 is almost indistinguishable from the pure f16 and it's half as big :D
f16/q5 is smaller and not so degraded as a pure q5.
Obviously these quants are bigger than the "pure" ones but the trade-off is great (imho)
if you check on my profile you will find all models quantized in this way.
https://huggingface.co./ZeroWw
Here's the command you use to quantize it in this way:
llama-quantize --allow-requantize --output-tensor-type f16 --token-embedding-type f16 {input_model_name}.gguf {output_model_name}.gguf Q5_K
Here's the command you use to quantize it in this way:
llama-quantize --allow-requantize --output-tensor-type f16 --token-embedding-type f16 {input_model_name}.gguf {output_model_name}.gguf Q5_K
yep. that's what I used and posted. also q6_k is great q4_k will degrade the model too much imho but it's still usable and obviously smaller.
usually a 7B quantized in my way at f16/q6_k runs great on cpu only devices...
Hey thanks, can you elaborate on how you managed to improve the performance of these quants?
You're welcome but in your model card you wrote: "GGUF (FP16)" while these are f16/q5_k f16_q6_k and f16_q8_0 (mixed quants... still doesn't have a name) :D