whisperkittools generated README.md

6f3d506 verified 11 months ago

6.44 kB


	---
	pretty_name: "WhisperKit ASR Evaluation Results"
	tags:
	- whisper
	- whisperkit
	- coreml
	- asr
	- quantized
	---
	# WhisperKit Evaluation Results



	## Dataset: `librispeech`

	### Quality Evaluation

	\| \| WER \| QoI (%) \| File Size (MB) \|
	\|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|------:\|----------:\|-----------------:\|
	\| [WhisperOpenAIAPI/openai_whisper-large-v2](https://huggingface.co./argmaxinc/whisperkit-coreml-staging/tree/main/WhisperOpenAIAPI/openai_whisper-large-v2) \| 2.85 \| 100 \| 3100 \|
	\| [WhisperKit/openai_whisper-large-v2](https://huggingface.co./argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-large-v2) \| 3.28 \| 96.6 \| 3100 \|
	\| [WhisperKit/openai_whisper-large-v2_1050MB](https://huggingface.co./argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-large-v2_1050MB) \| 3.32 \| 95 \| 1050 \|
	\| [WhisperKit/openai_whisper-large-v2_turbo](https://huggingface.co./argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-large-v2_turbo) \| 3.24 \| 96.6 \| 3100 \|
	\| [WhisperKit/openai_whisper-large-v2_turbo_1022MB](https://huggingface.co./argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-large-v2_turbo_1022MB) \| 3.33 \| 94.9 \| 1022 \|
	\| [whisper.cpp/openai_whisper-large-v2-q5_0](https://huggingface.co./argmaxinc/whisperkit-coreml-staging/tree/main/whisper.cpp/openai_whisper-large-v2-q5_0) \| 2.8 \| 96.6 \| 1080 \|
	\| [WhisperKit/openai_whisper-small](https://huggingface.co./argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-small) \| 3.98 \| 82.9 \| 483 \|
	\| [WhisperKit/openai_whisper-base](https://huggingface.co./argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-base) \| 6.11 \| 67.1 \| 145 \|
	\| [WhisperKit/openai_whisper-tiny](https://huggingface.co./argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-tiny) \| 8.94 \| 52.4 \| 66 \|
	\| [WhisperKit/openai_whisper-large-v3](https://huggingface.co./argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-large-v3) \| 2.48 \| 95.2 \| 3100 \|
	\| [WhisperKit/openai_whisper-large-v3_turbo](https://huggingface.co./argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-large-v3_turbo) \| 2.44 \| 95.4 \| 3100 \|
	\| [openai_whisper-large-v3_turbo_1018MB](https://huggingface.co./argmaxinc/whisperkit-coreml-staging/tree/main/openai_whisper-large-v3_turbo_1018MB) \| 2.49 \| 94.8 \| 1018 \|


	### Quality-of-Inference (QoI) Certification
	We believe that rigorously measuring the quality of inference is necessary for developers and
	enterprises to make informed decisions when opting to use optimized or compressed variants of
	Whisper models in production. The current measurements are between reference and optimized
	WhisperKit models. We are going to extend the scope of this measurement to other Whisper
	implementations soon so developers can certify the behavior change (if any) caused by
	alternating use of WhisperKit with (or migration from) these implementations.

	In all measurements, we care primarily about per-example no-regressions (quantified as `qoi` below)
	which is a stricter metric compared to dataset average WER. A 100% `qoi` preserves perfect
	backwards-compatibility on the test distribution and avoids "perceived regressions", the phenomenon
	where per-example known behavior changes after a code/model update and causes divergence in
	downstream code or breaks the user experience itself (even if dataset averages might stay flat
	across updates). Pseudocode for `qoi`:

	```python
	qoi = []
	for example in dataset:
	no_regression = wer(optimized_model(example)) <= wer(reference_model(example))
	qoi.append(no_regression)
	qoi = (sum(qoi) / len(qoi)) * 100.
	```

	We define the reference model as the default float16 precision Core ML model that is generated by
	whisperkittools. This reference model matches the accuracy of the original PyTorch model
	on the specified test sets. We use `librispeech/test.clean` (5 hours of short English audio clips)
	as our testing set for Whisper. We are actively expanding our test set coverage to `earnings22`
	(120 hours of long English audio clips with various accents). We anticipate developers that use Whisper in production to have
	their own Quality Assurance test sets and whisperkittools offers the tooling necessary to run the
	same measurements on such custom test sets, please see the [Model Evaluation on Custom Dataset](#evaluate-on-custom-dataset)
	for details.

	### Reproducing Results
	Results in this page are generated by our cluster of Apple Silicon Macs. We use them as self-hosted runners on
	Github Actions as our CI infrastructure. Due to [security concerns](https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions#hardening-for-self-hosted-runners),
	we are unable to open up the cluster to the public. However, any Apple Silicon Mac (even with 8GB RAM) can be used to
	run identical [evaluation jobs](#evaluation)
	locally. For reference, our M2 Ultra devices complete a `librispeech` + `openai/whisper-large-v3`
	evaluation in under 1 hour regardless of the Whisper implementation. Older Apple Silicon Macs should take less than
	1 day to complete the same evaluation.



	Glossary:

	- `_turbo`: Indicates the presence of additional optimizations (not compression) to unlock streaming transcription
	as described in our [Blog Post](https://www.takeargmax.com/blog/whisperkit).

	- `_*MB`: Indicates the presence of mixed-bit quantization. Instead of cluttering the filename with details like
	`_AudioEncoder-5.8bits_TextDecoder-6.1bits`, we choose to summarize the compression spec as the resulting total file size since this is what matters to developers in production.