Update README.md
Browse files
README.md
CHANGED
@@ -15,7 +15,7 @@ datasets:
|
|
15 |
|
16 |
# Non-timbral Embeddings extractor
|
17 |
This model produces embeddings that globally represent the non-timbral traits (prosody, accent, ...) of a speaker's voice. These embeddings can be used the same way as for a classical
|
18 |
-
speaker verification (ASV): to compare two voice signals,
|
19 |
The main difference with classical ASV embeddings is that here only the non-timbral traits are compared.
|
20 |
|
21 |
The model has been derived from the self-supervised pretrained model [WavLM-large](https://huggingface.co/microsoft/wavlm-large).
|
@@ -25,7 +25,7 @@ The next section explains how to compute these non-timbral embeddings.
|
|
25 |
|
26 |
# Usage
|
27 |
The following code snippet uses the file [spk_embeddings.py](https://huggingface.co/Orange/w-pro/blob/main/spk_embeddings.py)
|
28 |
-
to build the model
|
29 |
Its weights are then downloaded from this repository.
|
30 |
```
|
31 |
from spk_embeddings import EmbeddingsModel, compute_embedding
|
@@ -54,8 +54,10 @@ sim = float(torch.matmul(e1,e2.t()))
|
|
54 |
|
55 |
# Evaluations
|
56 |
Although it is not directly designed for this use case, evaluation on a standard ASV task can be performed with this model. Applied to
|
57 |
-
the [VoxCeleb1-clean test set](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/veri_test2.txt), it leads to an equal error rate
|
58 |
-
(
|
|
|
|
|
59 |
found in the paper mentioned hereafter, as well as other experiments showing correlations between these embeddings and non-timbral voice attributes.
|
60 |
|
61 |
Please note that the EER value can vary a little depending on the max_size defined to reduce long audios (max 30 seconds in our case).
|
|
|
15 |
|
16 |
# Non-timbral Embeddings extractor
|
17 |
This model produces embeddings that globally represent the non-timbral traits (prosody, accent, ...) of a speaker's voice. These embeddings can be used the same way as for a classical
|
18 |
+
speaker verification (ASV): in order to compare two voice signals, an embeddings vectors must be computed for each of them. Then the cosine similarity between the two embeddings can be used for comparison.
|
19 |
The main difference with classical ASV embeddings is that here only the non-timbral traits are compared.
|
20 |
|
21 |
The model has been derived from the self-supervised pretrained model [WavLM-large](https://huggingface.co/microsoft/wavlm-large).
|
|
|
25 |
|
26 |
# Usage
|
27 |
The following code snippet uses the file [spk_embeddings.py](https://huggingface.co/Orange/w-pro/blob/main/spk_embeddings.py)
|
28 |
+
to build the architecture of the model.
|
29 |
Its weights are then downloaded from this repository.
|
30 |
```
|
31 |
from spk_embeddings import EmbeddingsModel, compute_embedding
|
|
|
54 |
|
55 |
# Evaluations
|
56 |
Although it is not directly designed for this use case, evaluation on a standard ASV task can be performed with this model. Applied to
|
57 |
+
the [VoxCeleb1-clean test set](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/veri_test2.txt), it leads to an equal error rate
|
58 |
+
(EER, lower value denotes a better identification, random prediction leads to a value of 50%) of **10.681%**
|
59 |
+
(with a decision threshold of **0.467**).
|
60 |
+
This value can be interpreted as the ability to identify speakers only with non-timbral cues. A discussion about this interpretation can be
|
61 |
found in the paper mentioned hereafter, as well as other experiments showing correlations between these embeddings and non-timbral voice attributes.
|
62 |
|
63 |
Please note that the EER value can vary a little depending on the max_size defined to reduce long audios (max 30 seconds in our case).
|