statsmaths
/

diarize

+MIT License
+Copyright (c) 2023 CNRS
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,3 +1,152 @@
----
-license: mit
----

+---
+tags:
+- pyannote
+- pyannote-audio
+- pyannote-audio-pipeline
+- audio
+- voice
+- speech
+- speaker
+- speaker-diarization
+- speaker-change-detection
+- voice-activity-detection
+- overlapped-speech-detection
+- automatic-speech-recognition
+license: mit
+extra_gated_prompt: "The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers improve it further. Though this pipeline uses MIT license and will always remain open-source, we will occasionnally email you about premium pipelines and paid services around pyannote."
+extra_gated_fields:
+  Company/university: text
+  Website: text
+---
+Using this open-source model in production?
+Consider switching to [pyannoteAI](https://www.pyannote.ai) for better and faster options.
+# 🎹 Speaker diarization 3.0
+This pipeline has been trained by Séverin Baroudi with [pyannote.audio](https://github.com/pyannote/pyannote-audio) `3.0.0` using a combination of the training sets of AISHELL, AliMeeting, AMI, AVA-AVD, DIHARD, Ego4D, MSDWild, REPERE, and VoxConverse.
+It ingests mono audio sampled at 16kHz and outputs speaker diarization as an [`Annotation`](http://pyannote.github.io/pyannote-core/structure.html#annotation) instance:
+* stereo or multi-channel audio files are automatically downmixed to mono by averaging the channels.
+* audio files sampled at a different rate are resampled to 16kHz automatically upon loading.
+## Requirements
+1. Install [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) `3.0` with `pip install pyannote.audio`
+2. Accept [`pyannote/segmentation-3.0`](https://hf.co/pyannote/segmentation-3.0) user conditions
+3. Accept [`pyannote/speaker-diarization-3.0`](https://hf.co/pyannote-speaker-diarization-3.0) user conditions
+4. Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens).
+## Usage
+```python
+# instantiate the pipeline
+from pyannote.audio import Pipeline
+pipeline = Pipeline.from_pretrained(
+  "pyannote/speaker-diarization-3.0",
+  use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")
+# run the pipeline on an audio file
+diarization = pipeline("audio.wav")
+# dump the diarization output to disk using RTTM format
+with open("audio.rttm", "w") as rttm:
+    diarization.write_rttm(rttm)
+```
+### Processing on GPU
+`pyannote.audio` pipelines run on CPU by default.
+You can send them to GPU with the following lines:
+```python
+import torch
+pipeline.to(torch.device("cuda"))
+```
+Real-time factor is around 2.5% using one Nvidia Tesla V100 SXM2 GPU (for the neural inference part) and one Intel Cascade Lake 6248 CPU (for the clustering part).
+In other words, it takes approximately 1.5 minutes to process a one hour conversation.
+### Processing from memory
+Pre-loading audio files in memory may result in faster processing:
+```python
+waveform, sample_rate = torchaudio.load("audio.wav")
+diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})
+```
+### Monitoring progress
+Hooks are available to monitor the progress of the pipeline:
+```python
+from pyannote.audio.pipelines.utils.hook import ProgressHook
+with ProgressHook() as hook:
+    diarization = pipeline("audio.wav", hook=hook)
+```
+### Controlling the number of speakers
+In case the number of speakers is known in advance, one can use the `num_speakers` option:
+```python
+diarization = pipeline("audio.wav", num_speakers=2)
+```
+One can also provide lower and/or upper bounds on the number of speakers using `min_speakers` and `max_speakers` options:
+```python
+diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)
+```
+## Benchmark
+This pipeline has been benchmarked on a large collection of datasets.
+Processing is fully automatic:
+* no manual voice activity detection (as is sometimes the case in the literature)
+* no manual number of speakers (though it is possible to provide it to the pipeline)
+* no fine-tuning of the internal models nor tuning of the pipeline hyper-parameters to each dataset
+... with the least forgiving diarization error rate (DER) setup (named *"Full"* in [this paper](https://doi.org/10.1016/j.csl.2021.101254)):
+* no forgiveness collar
+* evaluation of overlapped speech
+| Benchmark                                                                                                                                   | [DER%](. "Diarization error rate") | [FA%](. "False alarm rate") | [Miss%](. "Missed detection rate") | [Conf%](. "Speaker confusion rate") | Expected output                                                                                                               | File-level evaluation                                                                                                         |
+| ------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------- | --------------------------- | ---------------------------------- | ----------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
+| [AISHELL-4](http://www.openslr.org/111/)                                                                                                    | 12.3                              | 3.8                        | 4.4                               | 4.1                                | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AISHELL.SpeakerDiarization.Benchmark.test.rttm)          | [eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AISHELL.SpeakerDiarization.Benchmark.test.eval)          |
+| [AliMeeting (*channel 1*)](https://www.openslr.org/119/)                                                                                    | 24.3                              | 4.4                        | 10.0                              | 9.9                                | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AliMeeting.SpeakerDiarization.Benchmark.test.rttm)       | [eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AliMeeting.SpeakerDiarization.Benchmark.test.eval)       |
+| [AMI (*headset mix,*](https://groups.inf.ed.ac.uk/ami/corpus/) [*only_words*)](https://github.com/BUTSpeechFIT/AMI-diarization-setup)       | 19.0                              | 3.6                        | 9.5                               | 5.9                                | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AMI.SpeakerDiarization.Benchmark.test.rttm)              | [eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AMI.SpeakerDiarization.Benchmark.test.eval)              |
+| [AMI (*array1, channel 1,*](https://groups.inf.ed.ac.uk/ami/corpus/) [*only_words)*](https://github.com/BUTSpeechFIT/AMI-diarization-setup) | 22.2                              | 3.8                        | 11.2                              | 7.3                                | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AMI-SDM.SpeakerDiarization.Benchmark.test.rttm)          | [eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AMI-SDM.SpeakerDiarization.Benchmark.test.eval)          |
+| [AVA-AVD](https://arxiv.org/abs/2111.14448) |  49.1 | 10.8 | 15.7| 22.5 | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AVA-AVD.SpeakerDiarization.Benchmark.test.rttm)       | [eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AVA-AVD.SpeakerDiarization.Benchmark.test.eval)       |
+| [DIHARD 3 (*Full*)](https://arxiv.org/abs/2012.01477)                                                                                       | 21.7                              | 6.2                       | 8.1                               | 7.3                                | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/DIHARD.SpeakerDiarization.Benchmark.test.rttm)           | [eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/DIHARD.SpeakerDiarization.Benchmark.test.eval)           |
+| [MSDWild](https://x-lance.github.io/MSDWILD/) | 24.6 | 5.8 | 8.0 | 10.7 | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/MSDWILD.SpeakerDiarization.Benchmark.test.rttm)       | [eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/MSDWILD.SpeakerDiarization.Benchmark.test.eval)       |
+| [REPERE (*phase 2*)](https://islrn.org/resources/360-758-359-485-0/)                                                                        | 7.8                               | 1.8                        | 2.6                               | 3.5                                | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/REPERE.SpeakerDiarization.Benchmark.test.rttm)           | [eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/REPERE.SpeakerDiarization.Benchmark.test.eval)           |
+| [VoxConverse (*v0.3*)](https://github.com/joonson/voxconverse)                                                                              | 11.3                              | 4.1                        | 3.4                               | 3.8                                | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/VoxConverse.SpeakerDiarization.Benchmark.test.rttm)       | [eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/VoxConverse.SpeakerDiarization.Benchmark.test.eval)       |
+## Citations
+```bibtex
+@inproceedings{Plaquet23,
+  author={Alexis Plaquet and Hervé Bredin},
+  title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
+  year=2023,
+  booktitle={Proc. INTERSPEECH 2023},
+}
+```
+```bibtex
+@inproceedings{Bredin23,
+  author={Hervé Bredin},
+  title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
+  year=2023,
+  booktitle={Proc. INTERSPEECH 2023},
+}
+```

config.yaml ADDED Viewed

	@@ -0,0 +1,19 @@

+version: 3.0.0
+pipeline:
+  name: pyannote.audio.pipelines.SpeakerDiarization
+  params:
+    clustering: AgglomerativeClustering
+    embedding: hbredin/wespeaker-voxceleb-resnet34-LM
+    embedding_batch_size: 1
+    embedding_exclude_overlap: true
+    segmentation: pyannote/segmentation-3.0
+    segmentation_batch_size: 32
+params:
+  clustering:
+    method: centroid
+    min_cluster_size: 12
+    threshold: 0.7045654963945799
+  segmentation:
+    min_duration_off: 0.0