{ "cells": [ { "cell_type": "markdown", "id": "75b58048-7d14-4fc6-8085-1fc08c81b4a6", "metadata": { "id": "75b58048-7d14-4fc6-8085-1fc08c81b4a6" }, "source": [ "# Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers" ] }, { "cell_type": "markdown", "id": "fbfa8ad5-4cdc-4512-9058-836cbbf65e1a", "metadata": { "id": "fbfa8ad5-4cdc-4512-9058-836cbbf65e1a" }, "source": [ "In this Colab, we present a step-by-step guide on how to fine-tune Whisper \n", "for any multilingual ASR dataset using Hugging Face 🤗 Transformers. This is a \n", "more \"hands-on\" version of the accompanying [blog post](https://huggingface.co./blog/fine-tune-whisper). \n", "For a more in-depth explanation of Whisper, the Common Voice dataset and the theory behind fine-tuning, the reader is advised to refer to the blog post." ] }, { "cell_type": "markdown", "id": "afe0d503-ae4e-4aa7-9af4-dbcba52db41e", "metadata": { "id": "afe0d503-ae4e-4aa7-9af4-dbcba52db41e" }, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "id": "9ae91ed4-9c3e-4ade-938e-f4c2dcfbfdc0", "metadata": { "id": "9ae91ed4-9c3e-4ade-938e-f4c2dcfbfdc0" }, "source": [ "Whisper is a pre-trained model for automatic speech recognition (ASR) \n", "published in [September 2022](https://openai.com/blog/whisper/) by the authors \n", "Alec Radford et al. from OpenAI. Unlike many of its predecessors, such as \n", "[Wav2Vec 2.0](https://arxiv.org/abs/2006.11477), which are pre-trained \n", "on un-labelled audio data, Whisper is pre-trained on a vast quantity of \n", "**labelled** audio-transcription data, 680,000 hours to be precise. \n", "This is an order of magnitude more data than the un-labelled audio data used \n", "to train Wav2Vec 2.0 (60,000 hours). What is more, 117,000 hours of this \n", "pre-training data is multilingual ASR data. This results in checkpoints \n", "that can be applied to over 96 languages, many of which are considered \n", "_low-resource_.\n", "\n", "When scaled to 680,000 hours of labelled pre-training data, Whisper models \n", "demonstrate a strong ability to generalise to many datasets and domains.\n", "The pre-trained checkpoints achieve competitive results to state-of-the-art \n", "ASR systems, with near 3% word error rate (WER) on the test-clean subset of \n", "LibriSpeech ASR and a new state-of-the-art on TED-LIUM with 4.7% WER (_c.f._ \n", "Table 8 of the [Whisper paper](https://cdn.openai.com/papers/whisper.pdf)).\n", "The extensive multilingual ASR knowledge acquired by Whisper during pre-training \n", "can be leveraged for other low-resource languages; through fine-tuning, the \n", "pre-trained checkpoints can be adapted for specific datasets and languages \n", "to further improve upon these results. We'll show just how Whisper can be fine-tuned \n", "for low-resource languages in this Colab." ] }, { "cell_type": "markdown", "id": "e59b91d6-be24-4b5e-bb38-4977ea143a72", "metadata": { "id": "e59b91d6-be24-4b5e-bb38-4977ea143a72" }, "source": [ "
\n", "\"Trulli\"\n", "
Figure 1: Whisper model. The architecture \n", "follows the standard Transformer-based encoder-decoder model. A \n", "log-Mel spectrogram is input to the encoder. The last encoder \n", "hidden states are input to the decoder via cross-attention mechanisms. The \n", "decoder autoregressively predicts text tokens, jointly conditional on the \n", "encoder hidden states and previously predicted tokens. Figure source: \n", "OpenAI Whisper Blog.
\n", "
" ] }, { "cell_type": "markdown", "id": "21b6316e-8a55-4549-a154-66d3da2ab74a", "metadata": { "id": "21b6316e-8a55-4549-a154-66d3da2ab74a" }, "source": [ "The Whisper checkpoints come in five configurations of varying model sizes.\n", "The smallest four are trained on either English-only or multilingual data.\n", "The largest checkpoint is multilingual only. All nine of the pre-trained checkpoints \n", "are available on the [Hugging Face Hub](https://huggingface.co./models?search=openai/whisper). The \n", "checkpoints are summarised in the following table with links to the models on the Hub:\n", "\n", "| Size | Layers | Width | Heads | Parameters | English-only | Multilingual |\n", "|--------|--------|-------|-------|------------|------------------------------------------------------|---------------------------------------------------|\n", "| tiny | 4 | 384 | 6 | 39 M | [✓](https://huggingface.co./openai/whisper-tiny.en) | [✓](https://huggingface.co./openai/whisper-tiny.) |\n", "| base | 6 | 512 | 8 | 74 M | [✓](https://huggingface.co./openai/whisper-base.en) | [✓](https://huggingface.co./openai/whisper-base) |\n", "| small | 12 | 768 | 12 | 244 M | [✓](https://huggingface.co./openai/whisper-small.en) | [✓](https://huggingface.co./openai/whisper-small) |\n", "| medium | 24 | 1024 | 16 | 769 M | [✓](https://huggingface.co./openai/whisper-medium.en) | [✓](https://huggingface.co./openai/whisper-medium) |\n", "| large | 32 | 1280 | 20 | 1550 M | x | [✓](https://huggingface.co./openai/whisper-large) |\n", "\n", "For demonstration purposes, we'll fine-tune the multilingual version of the \n", "[`\"small\"`](https://huggingface.co./openai/whisper-small) checkpoint with 244M params (~= 1GB). \n", "As for our data, we'll train and evaluate our system on a low-resource language \n", "taken from the [Common Voice](https://huggingface.co./datasets/mozilla-foundation/fleurs_11_0)\n", "dataset. We'll show that with as little as 8 hours of fine-tuning data, we can achieve \n", "strong performance in this language." ] }, { "cell_type": "markdown", "id": "3a680dfc-cbba-4f6c-8a1f-e1a5ff3f123a", "metadata": { "id": "3a680dfc-cbba-4f6c-8a1f-e1a5ff3f123a" }, "source": [ "------------------------------------------------------------------------\n", "\n", "\\\\({}^1\\\\) The name Whisper follows from the acronym “WSPSR”, which stands for “Web-scale Supervised Pre-training for Speech Recognition”." ] }, { "cell_type": "markdown", "id": "b219c9dd-39b6-4a95-b2a1-3f547a1e7bc0", "metadata": { "id": "b219c9dd-39b6-4a95-b2a1-3f547a1e7bc0" }, "source": [ "## Load Dataset\n", "Loading MS-MY Dataset from FLEURS.\n", "Combine train and validation set." ] }, { "cell_type": "code", "execution_count": 1, "id": "a2787582-554f-44ce-9f38-4180a5ed6b44", "metadata": { "id": "a2787582-554f-44ce-9f38-4180a5ed6b44" }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "6ff7d2f90d6046cfbd8532751c970e97", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading builder script: 0%| | 0.00/12.8k [00:00\n", "\"Trulli\"\n", "
Figure 2: Conversion of sampled audio array to log-Mel spectrogram.\n", "Left: sampled 1-dimensional audio signal. Right: corresponding log-Mel spectrogram. Figure source:\n", "Google SpecAugment Blog.\n", "
" ] }, { "cell_type": "markdown", "id": "b2ef54d5-b946-4c1d-9fdc-adc5d01b46aa", "metadata": { "id": "b2ef54d5-b946-4c1d-9fdc-adc5d01b46aa" }, "source": [ "We'll load the feature extractor from the pre-trained checkpoint with the default values:" ] }, { "cell_type": "code", "execution_count": 3, "id": "bc77d7bb-f9e2-47f5-b663-30f7a4321ce5", "metadata": { "id": "bc77d7bb-f9e2-47f5-b663-30f7a4321ce5" }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "3ab6ee91872d461a86bae35c206a8d74", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading: 0%| | 0.00/185k [00:00 1 will enable multiprocessing. If the `.map` method hangs with multiprocessing, set `num_proc=1` and process the dataset sequentially." ] }, { "cell_type": "code", "execution_count": 9, "id": "b459b0c5", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "5370b910ba054a4895c487fd81a8fb5b", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/2929 [00:00 Dict[str, torch.Tensor]:\n", " # split inputs and labels since they have to be of different lengths and need different padding methods\n", " # first treat the audio inputs by simply returning torch tensors\n", " input_features = [{\"input_features\": feature[\"input_features\"]} for feature in features]\n", " batch = self.processor.feature_extractor.pad(input_features, return_tensors=\"pt\")\n", "\n", " # get the tokenized label sequences\n", " label_features = [{\"input_ids\": feature[\"labels\"]} for feature in features]\n", " # pad the labels to max length\n", " labels_batch = self.processor.tokenizer.pad(label_features, return_tensors=\"pt\")\n", "\n", " # replace padding with -100 to ignore loss correctly\n", " labels = labels_batch[\"input_ids\"].masked_fill(labels_batch.attention_mask.ne(1), -100)\n", "\n", " # if bos token is appended in previous tokenization step,\n", " # cut bos token here as it's append later anyways\n", " if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():\n", " labels = labels[:, 1:]\n", "\n", " batch[\"labels\"] = labels\n", "\n", " return batch" ] }, { "cell_type": "markdown", "id": "3cae7dbf-8a50-456e-a3a8-7fd005390f86", "metadata": { "id": "3cae7dbf-8a50-456e-a3a8-7fd005390f86" }, "source": [ "Let's initialise the data collator we've just defined:" ] }, { "cell_type": "code", "execution_count": 15, "id": "fc834702-c0d3-4a96-b101-7b87be32bf42", "metadata": { "id": "fc834702-c0d3-4a96-b101-7b87be32bf42" }, "outputs": [], "source": [ "data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)" ] }, { "cell_type": "markdown", "id": "d62bb2ab-750a-45e7-82e9-61d6f4805698", "metadata": { "id": "d62bb2ab-750a-45e7-82e9-61d6f4805698" }, "source": [ "### Evaluation Metrics" ] }, { "cell_type": "markdown", "id": "66fee1a7-a44c-461e-b047-c3917221572e", "metadata": { "id": "66fee1a7-a44c-461e-b047-c3917221572e" }, "source": [ "We'll use the word error rate (WER) metric, the 'de-facto' metric for assessing \n", "ASR systems. For more information, refer to the WER [docs](https://huggingface.co./metrics/wer). We'll load the WER metric from 🤗 Evaluate:" ] }, { "cell_type": "code", "execution_count": 16, "id": "b22b4011-f31f-4b57-b684-c52332f92890", "metadata": { "id": "b22b4011-f31f-4b57-b684-c52332f92890" }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "8e7e70b2e8ba47c6bb0da2ef1a34d3e7", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading builder script: 0%| | 0.00/4.49k [00:00\n", " \n", " \n", " [ 553/10000 1:00:12 < 17:12:17, 0.15 it/s, Epoch 1.58/29]\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StepTraining LossValidation Loss

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "trainer.train()" ] }, { "cell_type": "markdown", "id": "810ced54-7187-4a06-b2fe-ba6dcca94dc3", "metadata": { "id": "810ced54-7187-4a06-b2fe-ba6dcca94dc3" }, "source": [ "We can label our checkpoint with the `whisper-event` tag on push by setting the appropriate key-word arguments (kwargs):" ] }, { "cell_type": "code", "execution_count": null, "id": "c704f91e-241b-48c9-b8e0-f0da396a9663", "metadata": { "id": "c704f91e-241b-48c9-b8e0-f0da396a9663" }, "outputs": [], "source": [ "kwargs = {\n", " \"dataset_tags\": [\"google/fleurs\", \"mozilla-foundation/common_voice_11_0\"],\n", " \"dataset\": [\"FLEURS\", \"Common Voice 11.0\"], # a 'pretty' name for the training dataset\n", " \"language\": \"id\",\n", " \"model_name\": \"Whisper Medium ID - FLEURS-CV\", # a 'pretty' name for your model\n", " \"finetuned_from\": \"openai/whisper-medium\",\n", " \"tasks\": \"automatic-speech-recognition\",\n", " \"tags\": \"whisper-event\",\n", "}" ] }, { "cell_type": "markdown", "id": "090d676a-f944-4297-a938-a40eda0b2b68", "metadata": { "id": "090d676a-f944-4297-a938-a40eda0b2b68" }, "source": [ "The training results can now be uploaded to the Hub. To do so, execute the `push_to_hub` command and save the preprocessor object we created:" ] }, { "cell_type": "code", "execution_count": null, "id": "d7030622-caf7-4039-939b-6195cdaa2585", "metadata": { "id": "d7030622-caf7-4039-939b-6195cdaa2585" }, "outputs": [], "source": [ "trainer.push_to_hub(**kwargs)" ] }, { "cell_type": "markdown", "id": "ca743fbd-602c-48d4-ba8d-a2fe60af64ba", "metadata": { "id": "ca743fbd-602c-48d4-ba8d-a2fe60af64ba" }, "source": [ "## Closing Remarks" ] }, { "cell_type": "markdown", "id": "7f737783-2870-4e35-aa11-86a42d7d997a", "metadata": { "id": "7f737783-2870-4e35-aa11-86a42d7d997a" }, "source": [ "In this blog, we covered a step-by-step guide on fine-tuning Whisper for multilingual ASR \n", "using 🤗 Datasets, Transformers and the Hugging Face Hub. For more details on the Whisper model, the Common Voice dataset and the theory behind fine-tuning, refere to the accompanying [blog post](https://huggingface.co./blog/fine-tune-whisper). If you're interested in fine-tuning other \n", "Transformers models, both for English and multilingual ASR, be sure to check out the \n", "examples scripts at [examples/pytorch/speech-recognition](https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition)." ] } ], "metadata": { "colab": { "include_colab_link": true, "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 5 }