File size: 5,449 Bytes
00afa45 be5bd12 093a289 be5bd12 093a289 be5bd12 093a289 be5bd12 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
---
language: "en"
tags:
- icefall
- k2
- transducer
- librispeech
- ASR
- stateless transducer
- PyTorch
- RNN-T
- speech recognition
license: "apache-2.0"
datasets:
- librispeech
metrics:
- WER
---
# Introduction
This repo contains pre-trained model using
<https://github.com/k2-fsa/icefall/pull/213>.
It is trained on full LibriSpeech dataset.
Also, it uses the `L` subset from [GigaSpeech](https://github.com/SpeechColab/GigaSpeech)
as extra training data.
## How to clone this repo
```
sudo apt-get install git-lfs
git clone https://huggingface.co./csukuangfj/icefall-asr-librispeech-transducer-stateless-multi-datasets-bpe-500-2022-03-01
cd icefall-asr-librispeech-transducer-stateless-multi-datasets-bpe-500-2022-03-01
git lfs pull
```
**Catuion**: You have to run `git lfs pull`. Otherwise, you will be SAD later.
The model in this repo is trained using the commit `2332ba312d7ce72f08c7bac1e3312f7e3dd722dc`.
You can use
```
git clone https://github.com/k2-fsa/icefall
cd icefall
git checkout 2332ba312d7ce72f08c7bac1e3312f7e3dd722dc
```
to download `icefall`.
You can find the model information by visiting
<https://github.com/k2-fsa/icefall/blob/2332ba312d7ce72f08c7bac1e3312f7e3dd722dc/egs/librispeech/ASR/transducer_stateless_multi_datasets/train.py#L218>
In short, the encoder is a Conformer model with 8 heads, 12 encoder layers, 512-dim attention, 2048-dim feedforward;
the decoder contains a 1024-dim embedding layer and a Conv1d with kernel size 2.
The decoder architecture is modified from
[Rnn-Transducer with Stateless Prediction Network](https://ieeexplore.ieee.org/document/9054419).
A Conv1d layer is placed right after the input embedding layer.
-----
## Description
This repo provides pre-trained transducer Conformer model for the LibriSpeech dataset
using [icefall][icefall]. There are no RNNs in the decoder. The decoder is stateless
and contains only an embedding layer and a Conv1d.
The commands for training are:
```
cd egs/librispeech/ASR/
./prepare.sh
./prepare_giga_speech.sh
export CUDA_VISIBLE_DEVICES="0,1,2,3"
./transducer_stateless_multi_datasets/train.py \
--world-size 4 \
--num-epochs 40 \
--start-epoch 0 \
--exp-dir transducer_stateless_multi_datasets/exp-full-2 \
--full-libri 1 \
--max-duration 300 \
--lr-factor 5 \
--bpe-model data/lang_bpe_500/bpe.model \
--modified-transducer-prob 0.25 \
--giga-prob 0.2
```
The tensorboard training log can be found at
<https://tensorboard.dev/experiment/xmo5oCgrRVelH9dCeOkYBg/>
The command for decoding is:
```bash
epoch=39
avg=15
sym=1
# greedy search
./transducer_stateless_multi_datasets/decode.py \
--epoch $epoch \
--avg $avg \
--exp-dir transducer_stateless_multi_datasets/exp-full-2 \
--bpe-model ./data/lang_bpe_500/bpe.model \
--max-duration 100 \
--context-size 2 \
--max-sym-per-frame $sym
# modified beam search
./transducer_stateless_multi_datasets/decode.py \
--epoch $epoch \
--avg $avg \
--exp-dir transducer_stateless_multi_datasets/exp-full-2 \
--bpe-model ./data/lang_bpe_500/bpe.model \
--max-duration 100 \
--context-size 2 \
--decoding-method modified_beam_search \
--beam-size 4
```
You can find the decoding log for the above command in this
repo (in the folder `log`).
The WERs for the test datasets are
| | test-clean | test-other | comment |
|-------------------------------------|------------|------------|------------------------------------------|
| greedy search (max sym per frame 1) | 2.64 | 6.55 | --epoch 39, --avg 15, --max-duration 100 |
| modified beam search (beam size 4) | 2.61 | 6.46 | --epoch 39, --avg 15, --max-duration 100 |
# File description
- [log][log], this directory contains the decoding log and decoding results
- [test_wavs][test_wavs], this directory contains wave files for testing the pre-trained model
- [data][data], this directory contains files generated by [prepare.sh][prepare]
- [exp][exp], this directory contains only one file: `preprained.pt`
`exp/pretrained.pt` is generated by the following command:
```bash
./transducer_stateless_multi_datasets/export.py \
--epoch 39 \
--avg 15 \
--bpe-model data/lang_bpe_500/bpe.model \
--exp-dir transducer_stateless_multi_datasets/exp-full-2
```
**HINT**: To use `pretrained.pt` to compute the WER for test-clean and test-other,
just do the following:
```
cp icefall-asr-librispeech-transducer-stateless-multi-datasets-bpe-500-2022-03-01/exp/pretrained.pt \
/path/to/icefall/egs/librispeech/ASR/transducer_stateless_multi_datasets/exp/epoch-999.pt
```
and pass `--epoch 999 --avg 1` to `transducer_stateless_multi_datasets/decode.py`.
[icefall]: https://github.com/k2-fsa/icefall
[prepare]: https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/prepare.sh
[exp]: https://huggingface.co./csukuangfj/icefall-asr-librispeech-transducer-stateless-multi-datasets-bpe-500-2022-03-01/tree/main/exp
[data]: https://huggingface.co./csukuangfj/icefall-asr-librispeech-transducer-stateless-multi-datasets-bpe-500-2022-03-01/tree/main/data
[test_wavs]: https://huggingface.co./csukuangfj/icefall-asr-librispeech-transducer-stateless-multi-datasets-bpe-500-2022-03-01/tree/main/test_wavs
[log]: https://huggingface.co./csukuangfj/icefall-asr-librispeech-transducer-stateless-multi-datasets-bpe-500-2022-03-01/tree/main/log
[icefall]: https://github.com/k2-fsa/icefall
|