ardaatahan's picture
update performance text
5753a9f
from textwrap import dedent
BANNER_TEXT = """
<div style="text-align: center;">
<h1><a href='https://github.com/argmaxinc/WhisperKitAndroid'>WhisperKit Android Benchmarks</a></h1>
</div>
"""
INTRO_LABEL = """We present comprehensive benchmarks for WhisperKit Android, our on-device ASR solution for Android devices, compared against a reference implementation. These benchmarks aim to help developers and enterprises make informed decisions when choosing optimized or compressed variants of machine learning models for production use. Show more."""
INTRO_TEXT = """
<h3 style="display: flex;
justify-content: center;
align-items: center;
"></h2>
\n📈 Key Metrics:
Word Error Rate (WER) (⬇️): The percentage of words incorrectly transcribed. Lower is better.
Quality of Inference (QoI) (⬆️): Percentage of examples where WhisperKit Android performs no worse than the reference model. Higher is better.
Tokens per Second (⬆️): The number of output tokens generated per second. Higher is better.
Speed (⬆️): Input audio seconds transcribed per second. Higher is better.
🎯 WhisperKi Android is evaluated across different datasets, with a focus on per-example no-regressions (QoI) and overall accuracy (WER).
\n💻 Our benchmarks include:
Reference: <a href='https://platform.openai.com/docs/guides/speech-to-text'>WhisperOpenAIAPI</a> (OpenAI's Whisper API)
On-device: <a href='https://github.com/argmaxinc/WhisperKitAndroid'>WhisperKit Android</a> (various versions and optimizations)
ℹ️ Reference Implementation:
<a href='https://platform.openai.com/docs/guides/speech-to-text'>WhisperOpenAIAPI</a> sets the reference standard. We assume it uses the equivalent of openai/whisper-large-v2 in float16 precision, along with additional undisclosed optimizations from OpenAI. As of 02/29/24, it costs $0.36 per hour of audio and has a 25MB file size limit per request.
\n🔍 We use two primary datasets:
<a href='https://huggingface.co./datasets/argmaxinc/librispeech'>LibriSpeech</a>: ~5 hours of short English audio clips
<a href='https://huggingface.co./datasets/argmaxinc/earnings22'>Earnings22</a>: ~120 hours of English audio from earnings calls
🔄 Results are periodically updated using our automated evaluation pipeline on Apple Silicon Macs.
\n🛠️ Developers can use <a href='https://github.com/argmaxinc/WhisperKitAndroid'>WhisperKit Android</a> to reproduce these results or run evaluations on their own custom datasets.
🔗 Links:
- <a href='https://github.com/argmaxinc/WhisperKit Android'>WhisperKit Android</a>
- <a href='https://github.com/argmaxinc/whisperkittools'>whisperkittools</a>
- <a href='https://huggingface.co./datasets/argmaxinc/librispeech'>LibriSpeech</a>
- <a href='https://huggingface.co./datasets/argmaxinc/earnings22'>Earnings22</a>
- <a href='https://platform.openai.com/docs/guides/speech-to-text'>WhisperOpenAIAPI</a>
"""
METHODOLOGY_TEXT = dedent(
"""
# Methodology
## Overview
WhisperKit Android Benchmarks is the one-stop shop for on-device performance and quality testing of WhisperKit Android models across supported devices, OS versions and audio datasets.
## Metrics
- **Speed factor** (⬆️): Computed as the ratio of input audio length to end-to-end WhisperKit Android latency for transcribing that audio. A speed factor of N means N seconds of input audio was transcribed in 1 second.
- **Tok/s (Tokens per second)** (⬆️): Total number of text decoder forward passes divided by the end-to-end processing time.
- This metric varies with input data given that the pace of speech changes the text decoder % of overall latency. This metric should not be confused with the reciprocal of the text decoder latency which is constant across input files.
- **WER (Word Error Rate)** (⬇️): The ratio of words incorrectly transcribed when comparing the model's output to reference transcriptions, with lower values indicating better accuracy.
- **QoI (Quality of Inference)** (⬆️): The ratio of examples where WhisperKit Android performs no worse than the reference model.
- This metric does not capture improvements to the reference. It only measures potential regressions.
## Data
- **Short-form**: 10 minutes of English audiobook clips with 30s/clip comprising a subset of the [librispeech test set](https://huggingface.co./datasets/argmaxinc/librispeech). Proxy for average streaming performance.
- **Long-form**: 10 minutes of earnings call recordings in English. Built from the [earnings22 test set](https://huggingface.co./datasets/argmaxinc/earnings22-12hours). Proxy for average from-file performance.
- Full datasets are used for English Quality tests and random 10-minute subsets are used for Performance tests.
## Performance Measurement
1. On-device testing is conducted with [WhisperKit Android Tests](https://github.com/argmaxinc/WhisperKitAndroid) on Android devices, across different Android versions.
2. Performance is recorded on 10-minute datasets described above for short- and long-form
3. Quality metrics are recorded on 10-minute datasets using an Apple M2 Pro CPU on a Linux host to allow for fast processing of many configurations and providing a consistent, high-performance baseline for all evaluations displayed in the English Quality tab.
4. Results are aggregated and presented in the dashboard, allowing for easy comparison and analysis.
## Dashboard Features
- Performance: Interactive filtering by model, device, OS, and performance metrics
- Timeline: Visualizations of performance trends
- English Quality: English transcription quality on short- and long-form audio
- Device Support: Matrix of supported device, OS and model version combinations. Unsupported combinations are marked with :warning:.
- This methodology ensures a comprehensive and fair evaluation of speech recognition models supported by WhisperKit Android across a wide range of scenarios and use cases.
"""
)
PERFORMANCE_TEXT = dedent(
"""
## Metrics
- **Speed factor** (⬆️): Computed as the ratio of input audio length to end-to-end WhisperKit Android latency for transcribing that audio. A speed factor of N means N seconds of input audio was transcribed in 1 second.
- **Tok/s (Tokens per second)** (⬆️): Total number of text decoder forward passes divided by the end-to-end processing time.
## Data
- **Short-form**: 10 minutes of English audiobook clips with 30s/clip comprising the [librispeech test set](https://huggingface.co./datasets/argmaxinc/librispeech).
- **Long-form**: 10 minutes of earnings call recordings in English with various accents. Built from the [earnings22 test set](https://huggingface.co./datasets/argmaxinc/earnings22-12hours).
"""
)
QUALITY_TEXT = dedent(
"""
## Metrics
- **WER (Word Error Rate)** (⬇️): The ratio of words incorrectly transcribed when comparing the model's output to reference transcriptions, with lower values indicating better accuracy.
- **QoI (Quality of Inference)** (⬆️): The ratio of examples where WhisperKit Android performs no worse than the reference model.
- This metric does not capture improvements to the reference. It only measures potential regressions.
"""
)
COL_NAMES = {
"model.model_version": "Model",
"device.product_name": "Device",
"device.os": "OS",
"average_wer": "Average WER",
"qoi": "QoI",
"speed": "Speed",
"tokens_per_second": "Tok / s",
"model": "Model",
"device": "Device",
"os": "OS",
"english_wer": "English WER",
"multilingual_wer": "Multilingual WER",
}
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""@misc{whisperkit-android-argmax,
title = {WhisperKit Android},
author = {Argmax, Inc.},
year = {2024},
URL = {https://github.com/argmaxinc/WhisperKitAndroid}
}"""
HEADER = """<div align="center">
<div position: relative>
<img
src=""
style="display:block;width:7%;height:auto;"
/>
</div>
</div>"""
EARNINGS22_URL = (
"https://huggingface.co./datasets/argmaxinc/earnings22-debug/resolve/main/{0}"
)
LIBRISPEECH_URL = (
"https://huggingface.co./datasets/argmaxinc/librispeech-debug/resolve/main/{0}"
)
AUDIO_URL = (
"https://huggingface.co./datasets/argmaxinc/whisperkit-test-data/resolve/main/"
)
WHISPER_OPEN_AI_LINK = "https://huggingface.co./datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/{}/{}"
BASE_WHISPERKIT_BENCHMARK_URL = "https://huggingface.co./datasets/argmaxinc/whisperkit-evals-dataset/blob/main/benchmark_data"