Table of Contents
Getting Started
Details on the model, it's performance, and more available on Arxiv.
Clone the model
The Reverb ASR model v1 is stored in this model repository.
Install inference requirements
See our inference code at https://github.com/revdotcom/reverb/tree/main/asr
About
Rev’s Reverb ASR was trained on 200,000 hours of English speech, all expertly transcribed by humans - the largest corpus of human transcribed audio ever used to train an open-source model. The quality of this data has produced the world’s most accurate English automatic speech recognition (ASR) system, using an efficient model architecture that can be run on either CPU or GPU. Additionally, Reverb ASR provides user control over the level of verbatimicity of the output transcript, making it ideal for both clean, readable transcription and use-cases like audio editing that require transcription of every spoken word including hesitations and re-wordings. Users can specify fully verbatim, fully non-verbatim, or anywhere in between for their transcription output.
Code
The folder wenet
is taken a fork of the WeNet repository, with some modifications made for Rev-specific architecture.
The folder wer_evaluation
contains instructions and code for running different benchmark utlities. These scripts are not specific to the Reverb architecture.
Features
Transcription Style Options
Reverb ASR was trained to produce transcriptions in either a verbatim style, in which every word is transcribed as spoken; or a non-verbatim style, in which disfluencies may be removed from the transcript.
Users can specify Reverb ASR's output style with the verbatimicity
parameter. 1 corresponds to a verbatim transcript and 0 corresponds to a non-verbatim transcript. Values between 0 and 1 are accepted and may correspond to a semi-non-verbatim style. See our demo here to test the verbatimicity
parameter with your own audio.
Decoding Options
Reverb ASR uses the joint CTC/attention architecture described here and here, and supports multiple modes of decoding. Users can specify one or more modes of decoding to recognize_wav.py
and separate output directories will be created for each decoding mode.
Decoding options are:
attention
ctc_greedy_search
ctc_prefix_beam_search
attention_rescoring
joint_decoding
Usage
python wenet/bin/recognize_wav.py --config model.yaml \
--checkpoint model.pt \
--audio hello_world.wav \
--modes ctc_prefix_beam_search attention_rescoring \
--gpu 0 \
--verbatimicity 1.0
Or check out our demo on HuggingFace.
Benchmarking
See wer_evaluation folder of https://github.com/revdotcom/reverb/tree/main/asr for details and results.
Cite this Model
If you use this model please use the following citation:
@misc{bhandari2024reverbopensourceasrdiarization,
title={Reverb: Open-Source ASR and Diarization from Rev},
author={Nishchal Bhandari and Danny Chen and Miguel Ángel del Río Fernández and Natalie Delworth and Jennifer Drexler Fox and Migüel Jetté and Quinten McNamara and Corey Miller and Ondřej Novotný and Ján Profant and Nan Qin and Martin Ratajczak and Jean-Philippe Robichaud},
year={2024},
eprint={2410.03930},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.03930},
}
Acknowledgments
Special thanks to the Wenet team for their work and for making it available under an open-source license.
License
See LICENSE for details.
- Downloads last month
- 26