General discussion.

#1
by Lewdiculous - opened

I can add the lower size GGUF Quantizations for the lower VRAM mates if that's interesting.

@Test157t - You got me thinking about Imatrix, what are the steps to generate the imatrix.dat data?

Never mind, I think I figured that out so it looks like it's:

imatrix.exe -m F16-model.gguf -f imatrix.txt -ngl <depends on your GPU VRAM, 0 to only use CPU and RAM>

Takes me 32 minutes only on CPU and RAM and 21 minutes with -ngl 14, as in 14 layers on GPU for an F16 GGUF of your 7B model.

If you know of any optimizations for this process let me know.

So when i make the imatrix.dat itself ill make it using an q8_0 quant and unload all 33 layers to gpu.
@Spacellary

Workflow is usually convert to fp16 -> quantize fp16 into q8_0 -> then use q8_0 to make imatrix with -ngl 33 -> then quantize remaining quants with imatrix using the fp16 model.

Added the new IQ3_S quant that was merged in today. Not sure how useful it is but it's good to see new improvements rolling in. All other quants are added as usual, up until Q5_K_M.

I think the new IQ3_S might actually help the folks with 6GB of VRAM.

@Test157t - I see. So is there not a significant precision loss with using the Q8_0 instead of the full F16 for generating the imatrix.dat or it's irrelevant in actual use?

@Lewdiculous - There is a drop in ppl in the final quant but i haven't seen it reflect negatively in actual use. if i recall correctly it was under a difference of 1.

Aight, well I'll definitely try to use the full F16 for that when I have ~20 minutes with my own machine idling to do that and use the Q8_0 when I don't, but I'll try label the imatrix files accordingly.

Keep up the cool experiments!

Sign up or log in to comment