add readme
Browse files
README.md
CHANGED
@@ -1,3 +1,165 @@
|
|
1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
license: apache-2.0
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
thumbnail: https://github.com/rinnakk/japanese-pretrained-models/blob/master/rinna.png
|
3 |
+
language: ja
|
4 |
+
datasets:
|
5 |
+
- reazon-research/reazonspeech
|
6 |
+
tags:
|
7 |
+
- automatic-speech-recognition
|
8 |
+
- speech
|
9 |
+
- audio
|
10 |
+
- hubert
|
11 |
+
- gpt_neox
|
12 |
+
- asr
|
13 |
+
- nlp
|
14 |
license: apache-2.0
|
15 |
---
|
16 |
+
|
17 |
+
# `rinna/nue-asr`
|
18 |
+
|
19 |
+
![rinna-icon](./rinna.png)
|
20 |
+
|
21 |
+
# Overview
|
22 |
+
[[Paper]](https://arxiv.org/abs/2312.03668)
|
23 |
+
[[GitHub]](https://github.com/rinnakk/nue-asr)
|
24 |
+
|
25 |
+
We propose a novel end-to-end speech recognition model, `Nue ASR`, which integrates pre-trained speech and language models.
|
26 |
+
|
27 |
+
The name `Nue` comes from the Japanese word ([`鵺/ぬえ/Nue`](https://en.wikipedia.org/wiki/Nue)), one of the Japanese legendary creatures ([`妖怪/ようかい/Yōkai`](https://en.wikipedia.org/wiki/Y%C5%8Dkai)).
|
28 |
+
|
29 |
+
This model is capable of performing highly accurate Japanese speech recognition.
|
30 |
+
By utilizing a GPU, it can recognize speech at speeds exceeding real-time.
|
31 |
+
|
32 |
+
Benchmark score including our models can be seen at https://rinnakk.github.io/research/benchmarks/asr/
|
33 |
+
|
34 |
+
* **Model architecture**
|
35 |
+
|
36 |
+
This model consists of three main components: HuBERT audio encoder, bridge network, and GPT-NeoX decoder.
|
37 |
+
The weights of HuBERT and GPT-NeoX were initialized with the pre-trained weights of HuBERT and GPT-NeoX, respectively.
|
38 |
+
- [japanese-hubert-base](https://huggingface.co/rinna/japanese-hubert-base)
|
39 |
+
- [japanese-gpt-neox-3.6b](https://huggingface.co/rinna/japanese-gpt-neox-3.6b)
|
40 |
+
|
41 |
+
* **Training**
|
42 |
+
|
43 |
+
The model was trained on approximately 19,000 hours of following Japanese speech corpus.
|
44 |
+
- [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech)
|
45 |
+
|
46 |
+
|
47 |
+
* **Authors**
|
48 |
+
|
49 |
+
- [Yukiya Hono](https://huggingface.co/yky-h)
|
50 |
+
- [Koh Mitsuda](https://huggingface.co/mitsu-koh)
|
51 |
+
- [Tianyu Zhao](https://huggingface.co/tianyuz)
|
52 |
+
- [Kentaro Mitsui](https://huggingface.co/Kentaro321)
|
53 |
+
- [Toshiaki Wakatsuki](https://huggingface.co/t-w)
|
54 |
+
- [Kei Sawada](https://huggingface.co/keisawada)
|
55 |
+
|
56 |
+
---
|
57 |
+
|
58 |
+
# How to use the model
|
59 |
+
|
60 |
+
First, install the code for inference this model.
|
61 |
+
|
62 |
+
```bash
|
63 |
+
pip install git+https://github.com/rinnakk/nue-asr.git
|
64 |
+
```
|
65 |
+
|
66 |
+
Command-line interface and python interface are available.
|
67 |
+
|
68 |
+
## Command-line usage
|
69 |
+
The following command will transcribe the audio file via the command line interface.
|
70 |
+
Audio files will be automatically downsampled to 16kHz.
|
71 |
+
```bash
|
72 |
+
nue-asr audio1.wav
|
73 |
+
```
|
74 |
+
You can specify multiple audio files.
|
75 |
+
```bash
|
76 |
+
nue-asr audio1.wav audio2.flac audio3.mp3
|
77 |
+
```
|
78 |
+
|
79 |
+
We can use DeepSpeed-Inference to accelerate the inference speed of GPT-NeoX module.
|
80 |
+
If you use DeepSpeed-Inference, you need to install DeepSpeed.
|
81 |
+
```bash
|
82 |
+
pip install deepspeed
|
83 |
+
```
|
84 |
+
|
85 |
+
Then, you can use DeepSpeed-Inference as follows:
|
86 |
+
```bash
|
87 |
+
nue-asr --use-deepspeed audio1.wav
|
88 |
+
```
|
89 |
+
|
90 |
+
Run `nue-asr --help` for more information.
|
91 |
+
|
92 |
+
## Python usage
|
93 |
+
The example of python interface is as follows:
|
94 |
+
```python
|
95 |
+
import nue_asr
|
96 |
+
|
97 |
+
model = nue_asr.load_model("rinna/nue-asr")
|
98 |
+
tokenizer = nue_asr.load_tokenizer("rinna/nue-asr")
|
99 |
+
|
100 |
+
result = nue_asr.transcribe(model, tokenizer, "path_to_audio.wav")
|
101 |
+
print(result.text)
|
102 |
+
```
|
103 |
+
`nue_asr.transcribe` function can accept audio data as either a `numpy.array` or a `torch.Tensor`, in addition to traditional audio waveform file paths.
|
104 |
+
|
105 |
+
Accelerating the inference speed of models using DeepSpeed-Inference is also available through the python interface.
|
106 |
+
```python
|
107 |
+
import nue_asr
|
108 |
+
|
109 |
+
model = nue_asr.load_model("rinna/nue-asr", use_deepspeed=True)
|
110 |
+
tokenizer = nue_asr.load_tokenizer("rinna/nue-asr")
|
111 |
+
|
112 |
+
result = nue_asr.transcribe(model, tokenizer, "path_to_audio.wav")
|
113 |
+
print(result.text)
|
114 |
+
```
|
115 |
+
|
116 |
+
---
|
117 |
+
|
118 |
+
# Tokenization
|
119 |
+
The model uses the same sentencepiece-based tokenizer as [japanese-gpt-neox-3.6b](https://huggingface.co/rinna/japanese-gpt-neox-3.6b).
|
120 |
+
|
121 |
+
---
|
122 |
+
|
123 |
+
# How to cite
|
124 |
+
```bibtex
|
125 |
+
@article{hono2023integration,
|
126 |
+
title={An Integration of Pre-Trained Speech and Language Models for End-to-End Speech Recognition},
|
127 |
+
author={Hono, Yukiya and Mitsuda, Koh and Zhao, Tianyu and Mitsui, Kentaro and Wakatsuki, Toshiaki and Sawada, Kei},
|
128 |
+
journal={arXiv preprint arXiv:2312.03668},
|
129 |
+
year={2023}
|
130 |
+
}
|
131 |
+
|
132 |
+
@misc{rinna-nue-asr,
|
133 |
+
title={rinna/nue-asr},
|
134 |
+
author={Hono, Yukiya and Mitsuda, Koh and Zhao, Tianyu and Mitsui, Kentaro and Wakatsuki, Toshiaki and Sawada, Kei},
|
135 |
+
url={https://huggingface.co/rinna/nue-asr}
|
136 |
+
}
|
137 |
+
```
|
138 |
+
---
|
139 |
+
|
140 |
+
# Citations
|
141 |
+
```bibtex
|
142 |
+
@article{hsu2021hubert,
|
143 |
+
title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
|
144 |
+
author={Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman},
|
145 |
+
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
|
146 |
+
year={2021},
|
147 |
+
volume={29},
|
148 |
+
pages={3451-3460},
|
149 |
+
doi={10.1109/TASLP.2021.3122291}
|
150 |
+
}
|
151 |
+
|
152 |
+
@software{andoniangpt2021gpt,
|
153 |
+
title={{GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch}},
|
154 |
+
author={Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Purohit, Shivanshu and Songz, Tri and Phil, Wang and Weinbach, Samuel},
|
155 |
+
url={https://www.github.com/eleutherai/gpt-neox},
|
156 |
+
doi={10.5281/zenodo.5879544},
|
157 |
+
month={8},
|
158 |
+
year={2021},
|
159 |
+
version={0.0.1},
|
160 |
+
}
|
161 |
+
```
|
162 |
+
---
|
163 |
+
|
164 |
+
# License
|
165 |
+
[The Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0)
|