ggmbr commited on
Commit
8e1b907
·
verified ·
1 Parent(s): 3cb5c96

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -4
README.md CHANGED
@@ -15,7 +15,7 @@ datasets:
15
 
16
  # Non-timbral Embeddings extractor
17
  This model produces embeddings that globally represent the non-timbral traits (prosody, accent, ...) of a speaker's voice. These embeddings can be used the same way as for a classical
18
- speaker verification (ASV): to compare two voice signals, extract an embeddings for each of them and compute the cosine similarity between the two embeddings.
19
  The main difference with classical ASV embeddings is that here only the non-timbral traits are compared.
20
 
21
  The model has been derived from the self-supervised pretrained model [WavLM-large](https://huggingface.co/microsoft/wavlm-large).
@@ -25,7 +25,7 @@ The next section explains how to compute these non-timbral embeddings.
25
 
26
  # Usage
27
  The following code snippet uses the file [spk_embeddings.py](https://huggingface.co/Orange/w-pro/blob/main/spk_embeddings.py)
28
- to build the model architecture.
29
  Its weights are then downloaded from this repository.
30
  ```
31
  from spk_embeddings import EmbeddingsModel, compute_embedding
@@ -54,8 +54,10 @@ sim = float(torch.matmul(e1,e2.t()))
54
 
55
  # Evaluations
56
  Although it is not directly designed for this use case, evaluation on a standard ASV task can be performed with this model. Applied to
57
- the [VoxCeleb1-clean test set](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/veri_test2.txt), it leads to an equal error rate (EER, lower value denotes a better identification, random prediction leads to a value of 50%) of **10.681%**
58
- (with a decision threshold of **0.467**). This value can be interpreted as the ability to identify speakers only with non-timbral cues. A discussion about this interpretation can be
 
 
59
  found in the paper mentioned hereafter, as well as other experiments showing correlations between these embeddings and non-timbral voice attributes.
60
 
61
  Please note that the EER value can vary a little depending on the max_size defined to reduce long audios (max 30 seconds in our case).
 
15
 
16
  # Non-timbral Embeddings extractor
17
  This model produces embeddings that globally represent the non-timbral traits (prosody, accent, ...) of a speaker's voice. These embeddings can be used the same way as for a classical
18
+ speaker verification (ASV): in order to compare two voice signals, an embeddings vectors must be computed for each of them. Then the cosine similarity between the two embeddings can be used for comparison.
19
  The main difference with classical ASV embeddings is that here only the non-timbral traits are compared.
20
 
21
  The model has been derived from the self-supervised pretrained model [WavLM-large](https://huggingface.co/microsoft/wavlm-large).
 
25
 
26
  # Usage
27
  The following code snippet uses the file [spk_embeddings.py](https://huggingface.co/Orange/w-pro/blob/main/spk_embeddings.py)
28
+ to build the architecture of the model.
29
  Its weights are then downloaded from this repository.
30
  ```
31
  from spk_embeddings import EmbeddingsModel, compute_embedding
 
54
 
55
  # Evaluations
56
  Although it is not directly designed for this use case, evaluation on a standard ASV task can be performed with this model. Applied to
57
+ the [VoxCeleb1-clean test set](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/veri_test2.txt), it leads to an equal error rate
58
+ (EER, lower value denotes a better identification, random prediction leads to a value of 50%) of **10.681%**
59
+ (with a decision threshold of **0.467**).
60
+ This value can be interpreted as the ability to identify speakers only with non-timbral cues. A discussion about this interpretation can be
61
  found in the paper mentioned hereafter, as well as other experiments showing correlations between these embeddings and non-timbral voice attributes.
62
 
63
  Please note that the EER value can vary a little depending on the max_size defined to reduce long audios (max 30 seconds in our case).