--- language: - en datasets: - mozilla-foundation/common_voice_13_0 - facebook/voxpopuli - LIUM/tedlium - librispeech_asr - fisher_corpus - Switchboard-1 - WSJ-0 metrics: - wer pipeline_tag: automatic-speech-recognition model-index: - name: tbd results: - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: LibriSpeech (clean) type: librispeech_asr config: other split: test args: language: en metrics: - type: wer value: 2.5 name: Test WER - type: wer value: 5.6 name: Test WER - task: type: Automatic Speech Recognition name: automatic-speech-recognition dataset: name: tedlium-v3 type: LIUM/tedlium config: release1 split: test args: language: en metrics: - type: wer value: 6.3 name: Test WER - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: Vox Populi type: facebook/voxpopuli config: en split: test args: language: en metrics: - type: wer value: 7.3 name: Test WER - task: type: Automatic Speech Recognition name: automatic-speech-recognition dataset: name: Mozilla Common Voice 13.0 type: mozilla-foundation/common_voice_13_0 config: en split: test args: language: en metrics: - type: wer value: 12.1 name: Test WER --- # EBranchRegulaFormer This is a **174M encoder-decoder Ebranchformer model** trained with an intermediate regularization technique on 6,000 hours of open-source English data. It achieves Word Error Rates (WERs) comparable to `openai/whisper-medium.en` across multiple datasets with just 1/4 of the parameters. Architecture details, training hyperparameters, and a description of the proposed technique will be added soon. *Disclaimer: The model currently hallucinates on segments containing silence only, as it was previously not trained on such data. The fix will be added soon.* The model can be used with the [`pipeline`](https://huggingface.co./docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) class to transcribe audio files of arbitrary length. ```python from transformers import pipeline model_id = "BUT-FIT/EBranchRegulaFormer-medium" pipe = pipeline("automatic-speech-recognition", model=model_id, feature_extractor=model_id, trust_remote_code=True) # In newer versions of transformers (>4.31.0), there is a bug in the pipeline inference type. # The warning can be ignored. pipe.type = "seq2seq" # Standard greedy decoding result = pipe("audio.wav") # Beam search decoding with joint CTC-autoregressive scorer generation_config = pipe.model.generation_config generation_config.ctc_weight = 0.3 generation_config.num_beams = 5 generation_config.ctc_margin = 0 result = pipe("audio.wav") ```