--- language: - pt license: apache-2.0 tags: - whisper-event - generated_from_trainer datasets: - mozilla-foundation/common_voice_11_0 metrics: - wer base_model: openai/whisper-medium model-index: - name: Whisper Medium Portuguese results: - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: mozilla-foundation/common_voice_11_0 pt type: mozilla-foundation/common_voice_11_0 config: pt split: test args: pt metrics: - type: wer value: 6.5785713084850626 name: Wer --- # Whisper Medium Portuguese 🇧🇷🇵🇹 Bem-vindo ao whisper medium para transcrição em português 👋🏻 If you are looking to **quickly**, and **reliably**, transcribe Portuguese audio to text, you are in the right place! With a state-of-the-art [Word Error Rate](https://huggingface.co./spaces/evaluate-metric/wer) (WER) of just **6.579** in Common Voice 11, this model offers an **x2** precision increase compared to prior state-of-the-art [wav2vec2](https://huggingface.co./Edresson/wav2vec2-large-xlsr-coraa-portuguese) models. Compared to the original [whisper-medium](https://huggingface.co./openai/whisper-medium) model it delivers an **x1.2** improvement 🚀. This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co./openai/whisper-medium) on the [mozilla-foundation/common_voice_11](https://huggingface.co./datasets/mozilla-foundation/common_voice_11_0) dataset. The following table displays a **comparison** between the results of our model and those achieved by the most downloaded models in the hub for [Portuguese Automatic Speech Recognition](https://huggingface.co./models?language=pt&pipeline_tag=automatic-speech-recognition&sort=downloads) 🗣: | Model | WER | Parameters | |--------------------------------------------------|:--------:|:------------:| | [openai/whisper-medium](https://huggingface.co./openai/whisper-medium) | 8.100 | 769M | | [jlondonobo/whisper-medium-pt](https://huggingface.co./jlondonobo/whisper-medium-pt) | **6.579** 🤗 | 769M | | [jonatasgrosman/wav2vec2-large-xlsr-53-portuguese](https://huggingface.co./jonatasgrosman/wav2vec2-large-xlsr-53-portuguese) | 11.310 | 317M | | [Edresson/wav2vec2-large-xlsr-coraa-portuguese](https://huggingface.co./Edresson/wav2vec2-large-xlsr-coraa-portuguese) | 20.080 | 317M | ### How to use You can use this model directly with a pipeline. This is especially useful for short audio. For **long-form** transcriptions please use the code in the [Long-form transcription](#long-form-transcription) section. ```bash pip install git+https://github.com/huggingface/transformers --force-reinstall pip install torch ``` ```python >>> from transformers import pipeline >>> import torch >>> device = 0 if torch.cuda.is_available() else "cpu" # Load the pipeline >>> transcribe = pipeline( ... task="automatic-speech-recognition", ... model="jlondonobo/whisper-medium-pt", ... chunk_length_s=30, ... device=device, ... ) # Force model to transcribe in Portuguese >>> transcribe.model.config.forced_decoder_ids = transcribe.tokenizer.get_decoder_prompt_ids(language="pt", task="transcribe") # Transcribe your audio file >>> transcribe("audio.m4a")["text"] 'Eu falo português.' ``` #### Long-form transcription To improve the performance of long-form transcription you can convert the HF model into a `whisper` model, and use the original paper's matching algorithm. To do this, you must install `whisper` and a set of tools developed by [@bayartsogt](https://huggingface.co./bayartsogt). ```bash pip install git+https://github.com/openai/whisper.git pip install git+https://github.com/bayartsogt-ya/whisper-multiple-hf-datasets ``` Then convert the HuggingFace model and transcribe: ```python >>> import torch >>> import whisper >>> from multiple_datasets.hub_default_utils import convert_hf_whisper >>> device = "cuda" if torch.cuda.is_available() else "cpu" # Write HF model to local whisper model >>> convert_hf_whisper("jlondonobo/whisper-medium-pt", "local_whisper_model.pt") # Load the whisper model >>> model = whisper.load_model("local_whisper_model.pt", device=device) # Transcribe arbitrarily long audio >>> model.transcribe("long_audio.m4a", language="pt")["text"] 'Olá eu sou o José. Tenho 23 anos e trabalho...' ``` ### Training hyperparameters We used the following hyperparameters for training: - `learning_rate`: 1e-05 - `train_batch_size`: 32 - `eval_batch_size`: 16 - `seed`: 42 - `optimizer`: Adam with betas=(0.9,0.999) and epsilon=1e-08 - `lr_scheduler_type`: linear - `lr_scheduler_warmup_steps`: 500 - `training_steps`: 5000 - `mixed_precision_training`: Native AMP ### Training results | Training Loss | Epoch | Step | Validation Loss | Wer | |:-------------:|:-----:|:----:|:---------------:|:------:| | 0.0698 | 1.09 | 1000 | 0.1876 | 7.189 | | 0.0218 | 3.07 | 2000 | 0.2254 | 7.110 | | 0.0053 | 5.06 | 3000 | 0.2711 | 6.969 | | 0.0017 | 7.04 | 4000 | 0.3030 | 6.686 | | 0.0005 | 9.02 | 5000 | 0.3205 | **6.579** 🤗 | ### Framework versions - Transformers 4.26.0.dev0 - Pytorch 1.13.0+cu117 - Datasets 2.7.1.dev0 - Tokenizers 0.13.2