igitman commited on
Commit
41d1523
·
1 Parent(s): ab74426

Update model and readme to remove space before punctuation

Browse files
README.md CHANGED
@@ -35,7 +35,7 @@ model-index:
35
  metrics:
36
  - name: Test WER
37
  type: wer
38
- value: 5.77
39
  - task:
40
  type: Automatic Speech Recognition
41
  name: automatic-speech-recognition
@@ -49,7 +49,7 @@ model-index:
49
  metrics:
50
  - name: Test WER
51
  type: wer
52
- value: 11.47
53
  - task:
54
  type: Automatic Speech Recognition
55
  name: speech-recognition
@@ -63,7 +63,7 @@ model-index:
63
  metrics:
64
  - name: Test WER
65
  type: wer
66
- value: 15.60
67
  - task:
68
  type: Automatic Speech Recognition
69
  name: speech-recognition
@@ -77,7 +77,7 @@ model-index:
77
  metrics:
78
  - name: Test WER P&C
79
  type: wer
80
- value: 8.17
81
  - task:
82
  type: Automatic Speech Recognition
83
  name: automatic-speech-recognition
@@ -91,7 +91,7 @@ model-index:
91
  metrics:
92
  - name: Test WER P&C
93
  type: wer
94
- value: 22.48
95
  - task:
96
  type: Automatic Speech Recognition
97
  name: speech-recognition
@@ -105,7 +105,7 @@ model-index:
105
  metrics:
106
  - name: Test WER P&C
107
  type: wer
108
- value: 19.55
109
  ---
110
  # NVIDIA FastConformer-Hybrid Large (it)
111
 
@@ -129,7 +129,7 @@ See the [model architecture](#model-architecture) section and [NeMo documentatio
129
  To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest Pytorch version.
130
  ```
131
  pip install nemo_toolkit['all']
132
- ```
133
 
134
  ## How to Use this Model
135
 
@@ -156,15 +156,15 @@ asr_model.transcribe(['2086-149220-0033.wav'])
156
 
157
  Using Transducer mode inference:
158
  ```shell
159
- python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
160
- pretrained_name="nvidia/stt_it_fastconformer_hybrid_large_pc"
161
  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
162
  ```
163
 
164
  Using CTC mode inference:
165
  ```shell
166
- python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
167
- pretrained_name="nvidia/stt_it_fastconformer_hybrid_large_pc"
168
  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
169
  decoder_type="ctc"
170
  ```
@@ -199,21 +199,21 @@ The model in this collection are trained on a composite dataset (NeMo PnC IT ASR
199
 
200
  The performance of Automatic Speech Recognition models is measuring using Word Error Rate. Since this dataset is trained on multiple domains and a much larger corpus, it will generally perform better at transcribing audio in general.
201
 
202
- The following tables summarizes the performance of the available models in this collection with the Transducer decoder. Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding.
203
 
204
 
205
  a) On data without Punctuation and Capitalization
206
 
207
  | Version | Tokenizer | Vocabulary Size | MCV 12.0 Dev | MCV 12.0 Test | MLS Dev | MLS Test | VoxPopuli Dev | VoxPopuli Test |
208
  |---------|-----------------------|-----------------|--------------|---------------|---------|----------|---------------|----------------|
209
- | 1.18.0 | SentencePiece Unigram | 1024 | 5.14% | 5.68% | 13.83% | 11.71% | 12.80% | 15.72% |
210
 
211
 
212
  b) On data with Punctuation and Capitalization
213
 
214
  | Version | Tokenizer | Vocabulary Size | MCV 12.0 Dev | MCV 12.0 Test | MLS Dev | MLS Test | VoxPopuli Dev | VoxPopuli Test |
215
  |---------|-----------------------|-----------------|--------------|---------------|---------|----------|---------------|----------------|
216
- | 1.18.0 | SentencePiece Unigram | 1024 | 7.75% | 8.17% | 26.37% | 22.48% | 16.78% | 19.55% |
217
 
218
 
219
  ## Limitations
@@ -221,15 +221,15 @@ Since this model was trained on publically available speech datasets, the perfor
221
 
222
  ## NVIDIA Riva: Deployment
223
 
224
- [NVIDIA Riva](https://developer.nvidia.com/riva), is an accelerated speech AI SDK deployable on-prem, in all clouds, multi-cloud, hybrid, on edge, and embedded.
225
- Additionally, Riva provides:
226
 
227
- * World-class out-of-the-box accuracy for the most common languages with model checkpoints trained on proprietary data with hundreds of thousands of GPU-compute hours
228
- * Best in class accuracy with run-time word boosting (e.g., brand and product names) and customization of acoustic model, language model, and inverse text normalization
229
- * Streaming speech recognition, Kubernetes compatible scaling, and enterprise-grade support
230
 
231
- Although this model isn’t supported yet by Riva, the [list of supported models is here](https://huggingface.co/models?other=Riva).
232
- Check out [Riva live demo](https://developer.nvidia.com/riva#demos).
233
 
234
  ## References
235
  [1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
@@ -240,4 +240,4 @@ Check out [Riva live demo](https://developer.nvidia.com/riva#demos).
240
 
241
  ## Licence
242
 
243
- License to use this model is covered by the [CC-BY-4 License](https://creativecommons.org/licenses/by/4.0/legalcode) unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4 License](https://creativecommons.org/licenses/by/4.0/legalcode).
 
35
  metrics:
36
  - name: Test WER
37
  type: wer
38
+ value: 5.67
39
  - task:
40
  type: Automatic Speech Recognition
41
  name: automatic-speech-recognition
 
49
  metrics:
50
  - name: Test WER
51
  type: wer
52
+ value: 11.11
53
  - task:
54
  type: Automatic Speech Recognition
55
  name: speech-recognition
 
63
  metrics:
64
  - name: Test WER
65
  type: wer
66
+ value: 16.16
67
  - task:
68
  type: Automatic Speech Recognition
69
  name: speech-recognition
 
77
  metrics:
78
  - name: Test WER P&C
79
  type: wer
80
+ value: 8.14
81
  - task:
82
  type: Automatic Speech Recognition
83
  name: automatic-speech-recognition
 
91
  metrics:
92
  - name: Test WER P&C
93
  type: wer
94
+ value: 22.06
95
  - task:
96
  type: Automatic Speech Recognition
97
  name: speech-recognition
 
105
  metrics:
106
  - name: Test WER P&C
107
  type: wer
108
+ value: 19.96
109
  ---
110
  # NVIDIA FastConformer-Hybrid Large (it)
111
 
 
129
  To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest Pytorch version.
130
  ```
131
  pip install nemo_toolkit['all']
132
+ ```
133
 
134
  ## How to Use this Model
135
 
 
156
 
157
  Using Transducer mode inference:
158
  ```shell
159
+ python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
160
+ pretrained_name="nvidia/stt_it_fastconformer_hybrid_large_pc"
161
  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
162
  ```
163
 
164
  Using CTC mode inference:
165
  ```shell
166
+ python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
167
+ pretrained_name="nvidia/stt_it_fastconformer_hybrid_large_pc"
168
  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
169
  decoder_type="ctc"
170
  ```
 
199
 
200
  The performance of Automatic Speech Recognition models is measuring using Word Error Rate. Since this dataset is trained on multiple domains and a much larger corpus, it will generally perform better at transcribing audio in general.
201
 
202
+ The following tables summarizes the performance of the available models in this collection with the Transducer decoder. Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding.
203
 
204
 
205
  a) On data without Punctuation and Capitalization
206
 
207
  | Version | Tokenizer | Vocabulary Size | MCV 12.0 Dev | MCV 12.0 Test | MLS Dev | MLS Test | VoxPopuli Dev | VoxPopuli Test |
208
  |---------|-----------------------|-----------------|--------------|---------------|---------|----------|---------------|----------------|
209
+ | 1.20.0 | SentencePiece BPE | 512 | 5.13% | 5.67% | 13.16% | 11.11% | 12.92% | 16.16% |
210
 
211
 
212
  b) On data with Punctuation and Capitalization
213
 
214
  | Version | Tokenizer | Vocabulary Size | MCV 12.0 Dev | MCV 12.0 Test | MLS Dev | MLS Test | VoxPopuli Dev | VoxPopuli Test |
215
  |---------|-----------------------|-----------------|--------------|---------------|---------|----------|---------------|----------------|
216
+ | 1.20.0 | SentencePiece BPE | 512 | 7.66% | 8.14% | 26.48% | 22.06% | 16.91% | 19.96% |
217
 
218
 
219
  ## Limitations
 
221
 
222
  ## NVIDIA Riva: Deployment
223
 
224
+ [NVIDIA Riva](https://developer.nvidia.com/riva), is an accelerated speech AI SDK deployable on-prem, in all clouds, multi-cloud, hybrid, on edge, and embedded.
225
+ Additionally, Riva provides:
226
 
227
+ * World-class out-of-the-box accuracy for the most common languages with model checkpoints trained on proprietary data with hundreds of thousands of GPU-compute hours
228
+ * Best in class accuracy with run-time word boosting (e.g., brand and product names) and customization of acoustic model, language model, and inverse text normalization
229
+ * Streaming speech recognition, Kubernetes compatible scaling, and enterprise-grade support
230
 
231
+ Although this model isn’t supported yet by Riva, the [list of supported models is here](https://huggingface.co/models?other=Riva).
232
+ Check out [Riva live demo](https://developer.nvidia.com/riva#demos).
233
 
234
  ## References
235
  [1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
 
240
 
241
  ## Licence
242
 
243
+ License to use this model is covered by the [CC-BY-4 License](https://creativecommons.org/licenses/by/4.0/legalcode) unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4 License](https://creativecommons.org/licenses/by/4.0/legalcode).
stt_it_fastconformer_hybrid_large_pc.nemo CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:54b843c8c5da2301a689a7e586ea1ff270655aa6d4e0ed1174e1ff041a7c6c56
3
- size 459223040
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1bf97c6b148d20c10dea8f950da3359244c7fb9681994153e41cda8c276e77ea
3
+ size 455505920