--- thumbnail: https://github.com/rinnakk/japanese-pretrained-models/blob/master/rinna.png language: ja datasets: - reazon-research/reazonspeech tags: - automatic-speech-recognition - speech - audio - hubert - gpt_neox - asr - nlp license: apache-2.0 --- # `rinna/nue-asr` ![rinna-icon](./rinna.png) # Overview [[Paper]](https://arxiv.org/abs/2312.03668) [[GitHub]](https://github.com/rinnakk/nue-asr) We propose a novel end-to-end speech recognition model, `Nue ASR`, which integrates pre-trained speech and language models. The name `Nue` comes from the Japanese word ([`鵺/ぬえ/Nue`](https://en.wikipedia.org/wiki/Nue)), one of the Japanese legendary creatures ([`妖怪/ようかい/Yōkai`](https://en.wikipedia.org/wiki/Y%C5%8Dkai)). This model is capable of performing highly accurate Japanese speech recognition. By utilizing a GPU, it can recognize speech at speeds exceeding real-time. Benchmark score including our models can be seen at https://rinnakk.github.io/research/benchmarks/asr/ * **Model architecture** This model consists of three main components: HuBERT audio encoder, bridge network, and GPT-NeoX decoder. The weights of HuBERT and GPT-NeoX were initialized with the pre-trained weights of HuBERT and GPT-NeoX, respectively. - [japanese-hubert-base](https://huggingface.co./rinna/japanese-hubert-base) - [japanese-gpt-neox-3.6b](https://huggingface.co./rinna/japanese-gpt-neox-3.6b) * **Training** The model was trained on approximately 19,000 hours of following Japanese speech corpus. - [ReazonSpeech](https://huggingface.co./datasets/reazon-research/reazonspeech) * **Authors** - [Yukiya Hono](https://huggingface.co./yky-h) - [Koh Mitsuda](https://huggingface.co./mitsu-koh) - [Tianyu Zhao](https://huggingface.co./tianyuz) - [Kentaro Mitsui](https://huggingface.co./Kentaro321) - [Toshiaki Wakatsuki](https://huggingface.co./t-w) - [Kei Sawada](https://huggingface.co./keisawada) --- # How to use the model First, install the code for inference this model. ```bash pip install git+https://github.com/rinnakk/nue-asr.git ``` Command-line interface and python interface are available. ## Command-line usage The following command will transcribe the audio file via the command line interface. Audio files will be automatically downsampled to 16kHz. ```bash nue-asr audio1.wav ``` You can specify multiple audio files. ```bash nue-asr audio1.wav audio2.flac audio3.mp3 ``` We can use DeepSpeed-Inference to accelerate the inference speed of GPT-NeoX module. If you use DeepSpeed-Inference, you need to install DeepSpeed. ```bash pip install deepspeed ``` Then, you can use DeepSpeed-Inference as follows: ```bash nue-asr --use-deepspeed audio1.wav ``` Run `nue-asr --help` for more information. ## Python usage The example of python interface is as follows: ```python import nue_asr model = nue_asr.load_model("rinna/nue-asr") tokenizer = nue_asr.load_tokenizer("rinna/nue-asr") result = nue_asr.transcribe(model, tokenizer, "path_to_audio.wav") print(result.text) ``` `nue_asr.transcribe` function can accept audio data as either a `numpy.array` or a `torch.Tensor`, in addition to traditional audio waveform file paths. Accelerating the inference speed of models using DeepSpeed-Inference is also available through the python interface. ```python import nue_asr model = nue_asr.load_model("rinna/nue-asr", use_deepspeed=True) tokenizer = nue_asr.load_tokenizer("rinna/nue-asr") result = nue_asr.transcribe(model, tokenizer, "path_to_audio.wav") print(result.text) ``` --- # Tokenization The model uses the same sentencepiece-based tokenizer as [japanese-gpt-neox-3.6b](https://huggingface.co./rinna/japanese-gpt-neox-3.6b). --- # How to cite ```bibtex @article{hono2023integration, title={An Integration of Pre-Trained Speech and Language Models for End-to-End Speech Recognition}, author={Hono, Yukiya and Mitsuda, Koh and Zhao, Tianyu and Mitsui, Kentaro and Wakatsuki, Toshiaki and Sawada, Kei}, journal={arXiv preprint arXiv:2312.03668}, year={2023} } @misc{rinna-nue-asr, title={rinna/nue-asr}, author={Hono, Yukiya and Mitsuda, Koh and Zhao, Tianyu and Mitsui, Kentaro and Wakatsuki, Toshiaki and Sawada, Kei}, url={https://huggingface.co./rinna/nue-asr} } ``` --- # Citations ```bibtex @article{hsu2021hubert, title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units}, author={Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, year={2021}, volume={29}, pages={3451-3460}, doi={10.1109/TASLP.2021.3122291} } @software{andoniangpt2021gpt, title={{GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch}}, author={Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Purohit, Shivanshu and Songz, Tri and Phil, Wang and Weinbach, Samuel}, url={https://www.github.com/eleutherai/gpt-neox}, doi={10.5281/zenodo.5879544}, month={8}, year={2021}, version={0.0.1}, } ``` --- # License [The Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0)