Add merged and quantized ONNX model files
#6
by
petewarden
- opened
Based on the great work by @Xenova in https://github.com/usefulsensors/moonshine/pull/73 I adjusted the quantized versions to make them more accurate.
There are three versions of each of the "tiny" and "base" models:
- Float: Unquantized float32 models, from @Xenova 's original PR.
- Quantized: 8-bit weights and activations, converted using
dynamic_quantization()
. - Quantized 4-bit: Lower precision version using 4-bit weights for the MatMul ops only.
The 8-bit quantized versions were created using the ONNX shrink ray tool with these commands:
python3 src/onnx_shrink_ray/shrink.py --output_suffix ".onnx" --output_dir tiny/quantized_temp --method "integer_activations" --nodes_to_exclude "/conv1/Conv,/conv2/Conv,/conv3/Conv" tiny/float
python3 src/onnx_shrink_ray/shrink.py --method "integer_weights" --output_suffix ".onnx" --output_dir tiny/quantized tiny/quantized_temp
The second command is needed to shrink the file size by converting the float32 conv weights into int8 equivalents.
The 4-bit versions were created using:
python3 src/onnx_shrink_ray/shrink.py --output_suffix ".onnx" --output_dir tiny/quantized_4bit --method "integer_activations" --nodes_to_exclude "/conv1/Conv,/conv2/Conv,/conv3/Conv" tiny/float
Here are the accuracy numbers from the Librispeech clean English dataset:
Model | Quantization | Accuracy | Total file size |
---|---|---|---|
Tiny | None | 4.51% | 133MB |
Tiny | 8-bit | 4.75% | 26MB |
Tiny | 4-bit | 4.54% | 44MB |
Base | None | 3.29% | 235MB |
Base | 8-bit | 3.30% | 59MB |
Base | 4-bit | 3.35% | 70MB |
The 8-bit quantized files are considerably smaller thank the float32 versions, for a small loss in accuracy. I don't yet have inference numbers.
petewarden
changed pull request status to
open
petewarden
changed pull request status to
merged