Automatic Speech Recognition
Keras
ONNX
English

Add merged and quantized ONNX model files

#6
by petewarden - opened

Based on the great work by @Xenova in https://github.com/usefulsensors/moonshine/pull/73 I adjusted the quantized versions to make them more accurate.

There are three versions of each of the "tiny" and "base" models:

  • Float: Unquantized float32 models, from @Xenova 's original PR.
  • Quantized: 8-bit weights and activations, converted using dynamic_quantization().
  • Quantized 4-bit: Lower precision version using 4-bit weights for the MatMul ops only.

The 8-bit quantized versions were created using the ONNX shrink ray tool with these commands:

python3 src/onnx_shrink_ray/shrink.py --output_suffix ".onnx" --output_dir tiny/quantized_temp --method "integer_activations" --nodes_to_exclude "/conv1/Conv,/conv2/Conv,/conv3/Conv" tiny/float
python3 src/onnx_shrink_ray/shrink.py --method "integer_weights" --output_suffix ".onnx" --output_dir tiny/quantized tiny/quantized_temp

The second command is needed to shrink the file size by converting the float32 conv weights into int8 equivalents.

The 4-bit versions were created using:

python3 src/onnx_shrink_ray/shrink.py --output_suffix ".onnx" --output_dir  tiny/quantized_4bit --method "integer_activations" --nodes_to_exclude "/conv1/Conv,/conv2/Conv,/conv3/Conv" tiny/float

Here are the accuracy numbers from the Librispeech clean English dataset:

Model Quantization Accuracy Total file size
Tiny None 4.51% 133MB
Tiny 8-bit 4.75% 26MB
Tiny 4-bit 4.54% 44MB
Base None 3.29% 235MB
Base 8-bit 3.30% 59MB
Base 4-bit 3.35% 70MB

The 8-bit quantized files are considerably smaller thank the float32 versions, for a small loss in accuracy. I don't yet have inference numbers.

petewarden changed pull request status to open
petewarden changed pull request status to merged

Sign up or log in to comment