Quantize more models?
#3
by
MiaoCata
- opened
Great work! But here's a fine-tuned model called DeepScaleR which has a better performance, could you quantize it with NexaQuant from the original Q8_0?
I think Q8_0 and FP16 provide similar performance, so using Q8_0 may be even faster
The DeepScaleR model exhibits superior performance; therefore, please quantize it using NexaQuant, starting from the original Q8_0 configuration.
@Losanti123 Thanks for bringing this up! Currently, we only support Q4_0, as standard 4-bit quantization leads to an observable reasoning performance loss. In contrast, 8-bit quantization has little performance loss compared to FP16.