statsmaths commited on
Commit
215f3c6
·
verified ·
1 Parent(s): 4f59737

Upload 3 files

Browse files
Files changed (3) hide show
  1. LICENSE +21 -0
  2. README.md +152 -3
  3. config.yaml +19 -0
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2023 CNRS
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md CHANGED
@@ -1,3 +1,152 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - pyannote
4
+ - pyannote-audio
5
+ - pyannote-audio-pipeline
6
+ - audio
7
+ - voice
8
+ - speech
9
+ - speaker
10
+ - speaker-diarization
11
+ - speaker-change-detection
12
+ - voice-activity-detection
13
+ - overlapped-speech-detection
14
+ - automatic-speech-recognition
15
+ license: mit
16
+ extra_gated_prompt: "The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers improve it further. Though this pipeline uses MIT license and will always remain open-source, we will occasionnally email you about premium pipelines and paid services around pyannote."
17
+ extra_gated_fields:
18
+ Company/university: text
19
+ Website: text
20
+ ---
21
+
22
+ Using this open-source model in production?
23
+ Consider switching to [pyannoteAI](https://www.pyannote.ai) for better and faster options.
24
+
25
+ # 🎹 Speaker diarization 3.0
26
+
27
+ This pipeline has been trained by Séverin Baroudi with [pyannote.audio](https://github.com/pyannote/pyannote-audio) `3.0.0` using a combination of the training sets of AISHELL, AliMeeting, AMI, AVA-AVD, DIHARD, Ego4D, MSDWild, REPERE, and VoxConverse.
28
+
29
+ It ingests mono audio sampled at 16kHz and outputs speaker diarization as an [`Annotation`](http://pyannote.github.io/pyannote-core/structure.html#annotation) instance:
30
+
31
+ * stereo or multi-channel audio files are automatically downmixed to mono by averaging the channels.
32
+ * audio files sampled at a different rate are resampled to 16kHz automatically upon loading.
33
+
34
+
35
+ ## Requirements
36
+
37
+ 1. Install [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) `3.0` with `pip install pyannote.audio`
38
+ 2. Accept [`pyannote/segmentation-3.0`](https://hf.co/pyannote/segmentation-3.0) user conditions
39
+ 3. Accept [`pyannote/speaker-diarization-3.0`](https://hf.co/pyannote-speaker-diarization-3.0) user conditions
40
+ 4. Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens).
41
+
42
+ ## Usage
43
+
44
+ ```python
45
+ # instantiate the pipeline
46
+ from pyannote.audio import Pipeline
47
+ pipeline = Pipeline.from_pretrained(
48
+ "pyannote/speaker-diarization-3.0",
49
+ use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")
50
+
51
+ # run the pipeline on an audio file
52
+ diarization = pipeline("audio.wav")
53
+
54
+ # dump the diarization output to disk using RTTM format
55
+ with open("audio.rttm", "w") as rttm:
56
+ diarization.write_rttm(rttm)
57
+ ```
58
+
59
+ ### Processing on GPU
60
+
61
+ `pyannote.audio` pipelines run on CPU by default.
62
+ You can send them to GPU with the following lines:
63
+
64
+ ```python
65
+ import torch
66
+ pipeline.to(torch.device("cuda"))
67
+ ```
68
+
69
+ Real-time factor is around 2.5% using one Nvidia Tesla V100 SXM2 GPU (for the neural inference part) and one Intel Cascade Lake 6248 CPU (for the clustering part).
70
+
71
+ In other words, it takes approximately 1.5 minutes to process a one hour conversation.
72
+
73
+ ### Processing from memory
74
+
75
+ Pre-loading audio files in memory may result in faster processing:
76
+
77
+ ```python
78
+ waveform, sample_rate = torchaudio.load("audio.wav")
79
+ diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})
80
+ ```
81
+
82
+ ### Monitoring progress
83
+
84
+ Hooks are available to monitor the progress of the pipeline:
85
+
86
+ ```python
87
+ from pyannote.audio.pipelines.utils.hook import ProgressHook
88
+ with ProgressHook() as hook:
89
+ diarization = pipeline("audio.wav", hook=hook)
90
+ ```
91
+
92
+ ### Controlling the number of speakers
93
+
94
+ In case the number of speakers is known in advance, one can use the `num_speakers` option:
95
+
96
+ ```python
97
+ diarization = pipeline("audio.wav", num_speakers=2)
98
+ ```
99
+
100
+ One can also provide lower and/or upper bounds on the number of speakers using `min_speakers` and `max_speakers` options:
101
+
102
+ ```python
103
+ diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)
104
+ ```
105
+
106
+ ## Benchmark
107
+
108
+ This pipeline has been benchmarked on a large collection of datasets.
109
+
110
+ Processing is fully automatic:
111
+
112
+ * no manual voice activity detection (as is sometimes the case in the literature)
113
+ * no manual number of speakers (though it is possible to provide it to the pipeline)
114
+ * no fine-tuning of the internal models nor tuning of the pipeline hyper-parameters to each dataset
115
+
116
+ ... with the least forgiving diarization error rate (DER) setup (named *"Full"* in [this paper](https://doi.org/10.1016/j.csl.2021.101254)):
117
+
118
+ * no forgiveness collar
119
+ * evaluation of overlapped speech
120
+
121
+ | Benchmark | [DER%](. "Diarization error rate") | [FA%](. "False alarm rate") | [Miss%](. "Missed detection rate") | [Conf%](. "Speaker confusion rate") | Expected output | File-level evaluation |
122
+ | ------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------- | --------------------------- | ---------------------------------- | ----------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
123
+ | [AISHELL-4](http://www.openslr.org/111/) | 12.3 | 3.8 | 4.4 | 4.1 | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AISHELL.SpeakerDiarization.Benchmark.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AISHELL.SpeakerDiarization.Benchmark.test.eval) |
124
+ | [AliMeeting (*channel 1*)](https://www.openslr.org/119/) | 24.3 | 4.4 | 10.0 | 9.9 | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AliMeeting.SpeakerDiarization.Benchmark.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AliMeeting.SpeakerDiarization.Benchmark.test.eval) |
125
+ | [AMI (*headset mix,*](https://groups.inf.ed.ac.uk/ami/corpus/) [*only_words*)](https://github.com/BUTSpeechFIT/AMI-diarization-setup) | 19.0 | 3.6 | 9.5 | 5.9 | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AMI.SpeakerDiarization.Benchmark.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AMI.SpeakerDiarization.Benchmark.test.eval) |
126
+ | [AMI (*array1, channel 1,*](https://groups.inf.ed.ac.uk/ami/corpus/) [*only_words)*](https://github.com/BUTSpeechFIT/AMI-diarization-setup) | 22.2 | 3.8 | 11.2 | 7.3 | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AMI-SDM.SpeakerDiarization.Benchmark.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AMI-SDM.SpeakerDiarization.Benchmark.test.eval) |
127
+ | [AVA-AVD](https://arxiv.org/abs/2111.14448) | 49.1 | 10.8 | 15.7| 22.5 | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AVA-AVD.SpeakerDiarization.Benchmark.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AVA-AVD.SpeakerDiarization.Benchmark.test.eval) |
128
+ | [DIHARD 3 (*Full*)](https://arxiv.org/abs/2012.01477) | 21.7 | 6.2 | 8.1 | 7.3 | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/DIHARD.SpeakerDiarization.Benchmark.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/DIHARD.SpeakerDiarization.Benchmark.test.eval) |
129
+ | [MSDWild](https://x-lance.github.io/MSDWILD/) | 24.6 | 5.8 | 8.0 | 10.7 | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/MSDWILD.SpeakerDiarization.Benchmark.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/MSDWILD.SpeakerDiarization.Benchmark.test.eval) |
130
+ | [REPERE (*phase 2*)](https://islrn.org/resources/360-758-359-485-0/) | 7.8 | 1.8 | 2.6 | 3.5 | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/REPERE.SpeakerDiarization.Benchmark.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/REPERE.SpeakerDiarization.Benchmark.test.eval) |
131
+ | [VoxConverse (*v0.3*)](https://github.com/joonson/voxconverse) | 11.3 | 4.1 | 3.4 | 3.8 | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/VoxConverse.SpeakerDiarization.Benchmark.test.rttm) | [eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/VoxConverse.SpeakerDiarization.Benchmark.test.eval) |
132
+
133
+
134
+ ## Citations
135
+
136
+ ```bibtex
137
+ @inproceedings{Plaquet23,
138
+ author={Alexis Plaquet and Hervé Bredin},
139
+ title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
140
+ year=2023,
141
+ booktitle={Proc. INTERSPEECH 2023},
142
+ }
143
+ ```
144
+
145
+ ```bibtex
146
+ @inproceedings{Bredin23,
147
+ author={Hervé Bredin},
148
+ title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
149
+ year=2023,
150
+ booktitle={Proc. INTERSPEECH 2023},
151
+ }
152
+ ```
config.yaml ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ version: 3.0.0
2
+
3
+ pipeline:
4
+ name: pyannote.audio.pipelines.SpeakerDiarization
5
+ params:
6
+ clustering: AgglomerativeClustering
7
+ embedding: hbredin/wespeaker-voxceleb-resnet34-LM
8
+ embedding_batch_size: 1
9
+ embedding_exclude_overlap: true
10
+ segmentation: pyannote/segmentation-3.0
11
+ segmentation_batch_size: 32
12
+
13
+ params:
14
+ clustering:
15
+ method: centroid
16
+ min_cluster_size: 12
17
+ threshold: 0.7045654963945799
18
+ segmentation:
19
+ min_duration_off: 0.0