Orange
/

Speaker-wavLM-pro

@@ -15,7 +15,7 @@ datasets:
 # Non-timbral Embeddings extractor
 This model produces embeddings that globally represent the non-timbral traits (prosody, accent, ...) of a speaker's voice. These embeddings can be used the same way as for a classical
-speaker verification (ASV): to compare two voice signals, extract an embeddings for each of them and compute the cosine similarity between the two embeddings.
 The main difference with classical ASV embeddings is that here only the non-timbral traits are compared.
 The model has been derived from the self-supervised pretrained model [WavLM-large](https://huggingface.co/microsoft/wavlm-large).
@@ -25,7 +25,7 @@ The next section explains how to compute these non-timbral embeddings.
 # Usage
 The following code snippet uses the file [spk_embeddings.py](https://huggingface.co/Orange/w-pro/blob/main/spk_embeddings.py)
-to build the model architecture.
 Its weights are then downloaded from this repository.
 ```
 from spk_embeddings import EmbeddingsModel, compute_embedding
@@ -54,8 +54,10 @@ sim = float(torch.matmul(e1,e2.t()))
 # Evaluations
 Although it is not directly designed for this use case, evaluation on a standard ASV task can be performed with this model. Applied to
-the [VoxCeleb1-clean test set](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/veri_test2.txt), it leads to an equal error rate (EER, lower value denotes a better identification, random prediction leads to a value of 50%) of **10.681%**
-(with a decision threshold of **0.467**). This value can be interpreted as the ability to identify speakers only with non-timbral cues. A discussion about this interpretation can be
 found in the paper mentioned hereafter, as well as other experiments showing correlations between these embeddings and non-timbral voice attributes.
 Please note that the EER value can vary a little depending on the max_size defined to reduce long audios (max 30 seconds in our case).

 # Non-timbral Embeddings extractor
 This model produces embeddings that globally represent the non-timbral traits (prosody, accent, ...) of a speaker's voice. These embeddings can be used the same way as for a classical
+speaker verification (ASV): in order to compare two voice signals, an embeddings vectors must be computed for each of them. Then the cosine similarity between the two embeddings can be used for comparison.
 The main difference with classical ASV embeddings is that here only the non-timbral traits are compared.
 The model has been derived from the self-supervised pretrained model [WavLM-large](https://huggingface.co/microsoft/wavlm-large).
 # Usage
 The following code snippet uses the file [spk_embeddings.py](https://huggingface.co/Orange/w-pro/blob/main/spk_embeddings.py)
+to build the architecture of the model.
 Its weights are then downloaded from this repository.
 ```
 from spk_embeddings import EmbeddingsModel, compute_embedding
 # Evaluations
 Although it is not directly designed for this use case, evaluation on a standard ASV task can be performed with this model. Applied to
+the [VoxCeleb1-clean test set](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/veri_test2.txt), it leads to an equal error rate
+(EER, lower value denotes a better identification, random prediction leads to a value of 50%) of **10.681%**
+(with a decision threshold of **0.467**).
+This value can be interpreted as the ability to identify speakers only with non-timbral cues. A discussion about this interpretation can be
 found in the paper mentioned hereafter, as well as other experiments showing correlations between these embeddings and non-timbral voice attributes.
 Please note that the EER value can vary a little depending on the max_size defined to reduce long audios (max 30 seconds in our case).