Quantization Onnx FP32 to q4f16 for Web
Web has model size limitation, and Phi3.5 use q4f16 to reduce the weight, if there any public framework can do that?
Pretty common to use 4bit quantization for llms. I used this script that takes are of it:
https://github.com/microsoft/onnxruntime-genai/blob/main/src/python/py/models/builder.py
and under the hood it will use
https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/quantization
for the quantization.
Thanks for info, seems even use this builder.py I can not make external data smaller enough to fit chrome browser fetch limitation and the model.onnx is around 1MB only. Is there any specific parameter to make model.onnx bigger and external data smaller? Thanks.
seems I can change the size_threshold to let bigger tensor into external data file. But it seems even I used this way, I cannot get the same inference result as your original onnx_web. is there any specific setting for convert to web such as builder.py xxx -e web?. Thank you
I use this script to shape the external data:
https://github.com/guschmue/ort-web-perf/blob/master/onnx-chunk-external-data.py
and this script to cast the logits to fp32 so javascript does not need to deal with fp16:
https://github.com/guschmue/ort-web-perf/blob/master/onnx-wrap-fp16.py
The entire thing looks like this:
root=$PWD
model=models/tjs/Phi-3.5-mini-instruct-onnx-web
python builder.py -m models/microsoft/Phi-3.5-mini-instruct -o $model -p int4 -e web
rm -rf /tmp/opt.* /tmp/model.onnx* $model/onnx
mkdir $model/onnx
python onnx/onnx-wrap-fp16.py --input $model/model.onnx --output /tmp/model.onnx --external_data --name logits
python onnx/onnx-chunk-external-data.py --threshhold 1 --maxchunks 1 --input /tmp/model.onnx --output $model/onnx/model_q4f16.onnx
cp models/microsoft/Phi-3.5-mini-instruct/*.json $model/