Quantize more models?

#3
by MiaoCata - opened

Great work! But here's a fine-tuned model called DeepScaleR which has a better performance, could you quantize it with NexaQuant from the original Q8_0?

I think Q8_0 and FP16 provide similar performance, so using Q8_0 may be even faster

The DeepScaleR model exhibits superior performance; therefore, please quantize it using NexaQuant, starting from the original Q8_0 configuration.

Nexa AI org

@Losanti123 Thanks for bringing this up! Currently, we only support Q4_0, as standard 4-bit quantization leads to an observable reasoning performance loss. In contrast, 8-bit quantization has little performance loss compared to FP16.

Sign up or log in to comment