imedennikov commited on
Commit
3379f44
·
verified ·
1 Parent(s): f059506

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -19
README.md CHANGED
@@ -144,43 +144,46 @@ The model is available for use in the NeMo Framework[5], and can be used as a pr
144
  from nemo.collections.asr.models import SortformerEncLabelModel
145
 
146
  # load model
147
- diar_model = SortformerEncLabelModel.restore_from(restore_path="diar_sortformer_4spk-v1", map_location=torch.device('cuda'), strict=False)
148
  ```
149
 
150
  ### Input Format
151
- Input to Sortformer can be either a list of paths to audio files or a jsonl manifest file.
152
-
153
  ```python
154
- pred_outputs = diar_model.diarize(audio=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"], batch_size=1)
155
  ```
156
-
157
- Individual audio file can be fed into Sortformer model as follows:
158
  ```python
159
- pred_output1 = diar_model.diarize(audio="/path/to/multispeaker_audio1.wav", batch_size=1)
160
  ```
161
-
162
-
163
- To use Sortformer for performing diarization on a multi-speaker audio recording, specify the input as jsonl manifest file, where each line in the file is a dictionary containing the following fields:
164
-
 
165
  ```yaml
166
  # Example of a line in `multispeaker_manifest.json`
167
  {
168
  "audio_filepath": "/path/to/multispeaker_audio1.wav", # path to the input audio file
169
- "offset": 0 # offset (start) time of the input audio
170
  "duration": 600, # duration of the audio, can be set to `null` if using NeMo main branch
171
  }
172
  {
173
  "audio_filepath": "/path/to/multispeaker_audio2.wav",
174
- "offset": 0,
175
  "duration": 580,
176
  }
177
  ```
178
 
179
- and then use:
 
180
  ```python
181
- pred_outputs = diar_model.diarize(audio="/path/to/multispeaker_manifest.json", batch_size=1)
 
 
 
 
182
  ```
183
-
184
 
185
  ### Input
186
 
@@ -190,7 +193,7 @@ This model accepts single-channel (mono) audio sampled at 16,000 Hz.
190
 
191
  ### Output
192
 
193
- The output of the model is an T x S matrix, where:
194
  - S is the maximum number of speakers (in this model, S = 4).
195
  - T is the total number of frames, including zero-padding. Each frame corresponds to a segment of 0.08 seconds of audio.
196
  Each element of the T x S matrix represents the speaker activity probability in the [0, 1] range. For example, a matrix element a(150, 2) = 0.95 indicates a 95% probability of activity for the second speaker during the time range [12.00, 12.08] seconds.
@@ -202,9 +205,27 @@ Each element of the T x S matrix represents the speaker activity probability in
202
  Sortformer diarizer models are trained on 8 nodes of 8×NVIDIA Tesla V100 GPUs. We use 90 second long training samples and batch size of 4.
203
  The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/sortformer_diar_train.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/conf/neural_diarizer/sortformer_diarizer_hybrid_loss_4spk-v1.yaml).
204
 
205
- ### Inference
206
 
207
- Sortformer diarizer models can be performed with post-processing algorithms using inference [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py). If you provide the post-processing YAML configs in [`post_processing` folder](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing) to reproduce the optimized post-processing algorithm for each development dataset.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
208
 
209
  ### Technical Limitations
210
 
 
144
  from nemo.collections.asr.models import SortformerEncLabelModel
145
 
146
  # load model
147
+ diar_model = SortformerEncLabelModel.restore_from(restore_path="/path/to/diar_sortformer_4spk-v1.nemo", map_location=torch.device('cuda'), strict=False)
148
  ```
149
 
150
  ### Input Format
151
+ Input to Sortformer can be an individual audio file:
 
152
  ```python
153
+ audio_input="/path/to/multispeaker_audio1.wav"
154
  ```
155
+ or a list of paths to audio files:
 
156
  ```python
157
+ audio_input=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"]
158
  ```
159
+ or a jsonl manifest file:
160
+ ```python
161
+ audio_input="/path/to/multispeaker_manifest.json"
162
+ ```
163
+ where each line is a dictionary containing the following fields:
164
  ```yaml
165
  # Example of a line in `multispeaker_manifest.json`
166
  {
167
  "audio_filepath": "/path/to/multispeaker_audio1.wav", # path to the input audio file
168
+ "offset": 0, # offset (start) time of the input audio
169
  "duration": 600, # duration of the audio, can be set to `null` if using NeMo main branch
170
  }
171
  {
172
  "audio_filepath": "/path/to/multispeaker_audio2.wav",
173
+ "offset": 900,
174
  "duration": 580,
175
  }
176
  ```
177
 
178
+ ### Getting Diarization Results
179
+ To perform speaker diarization and get a list of speaker-marked speech segments in the format 'begin_seconds, end_seconds, speaker_index', simply use:
180
  ```python
181
+ predicted_segments = diar_model.diarize(audio=audio_input, batch_size=1)
182
+ ```
183
+ To also obtain tensors of speaker activity probabilities, use:
184
+ ```python
185
+ predicted_segments, predicted_probs = diar_model.diarize(audio=audio_input, batch_size=1, include_tensor_outputs=True)
186
  ```
 
187
 
188
  ### Input
189
 
 
193
 
194
  ### Output
195
 
196
+ The output of the model is a T x S matrix, where:
197
  - S is the maximum number of speakers (in this model, S = 4).
198
  - T is the total number of frames, including zero-padding. Each frame corresponds to a segment of 0.08 seconds of audio.
199
  Each element of the T x S matrix represents the speaker activity probability in the [0, 1] range. For example, a matrix element a(150, 2) = 0.95 indicates a 95% probability of activity for the second speaker during the time range [12.00, 12.08] seconds.
 
205
  Sortformer diarizer models are trained on 8 nodes of 8×NVIDIA Tesla V100 GPUs. We use 90 second long training samples and batch size of 4.
206
  The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/sortformer_diar_train.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/conf/neural_diarizer/sortformer_diarizer_hybrid_loss_4spk-v1.yaml).
207
 
208
+ ### Evaluation
209
 
210
+ To evaluate Sortformer diarizer and save diarization results in RTTM format, use the inference [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py):
211
+ ```shell
212
+ python [NEMO_GIT_FOLDER]/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py
213
+ model_path="/path/to/diar_sortformer_4spk-v1.nemo"
214
+ manifest_filepath="/path/to/multispeaker_manifest_with_reference_rttms.json"
215
+ collar=COLLAR
216
+ out_rttm_dir="/path/to/output_rttms"
217
+ ```
218
+
219
+ You can provide the post-processing YAML configs from [`post_processing` folder](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing) to reproduce the optimized post-processing algorithm for each development dataset:
220
+ ```shell
221
+ python [NEMO_GIT_FOLDER]/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py
222
+ model_path="/path/to/diar_sortformer_4spk-v1.nemo"
223
+ manifest_filepath="/path/to/multispeaker_manifest_with_reference_rttms.json"
224
+ collar=COLLAR
225
+ bypass_postprocessing=False
226
+ postprocessing_yaml="/path/to/postprocessing_config.yaml"
227
+ out_rttm_dir="/path/to/output_rttms"
228
+ ```
229
 
230
  ### Technical Limitations
231