wavlm_ssl_sv
This repository contains the source code of the article Towards Supervised Performance on Speaker Verification with Self-Supervised Learning by Leveraging Large-Scale ASR Models (INTERSPEECH 2024) [arXiv].
The proposed framework fine-tunes a pre-trained WavLM using pseudo-labels, generated through Self-Supervised Learning (SSL), for Speaker Verification (SV). Initial pseudo-labels are derived from an SSL DINO-based model and are iteratively refined by clustering the model embeddings.
Our method achieves 0.99% EER on VoxCeleb1-O, establishing the new SOTA on Speaker Verification with SSL.
Please refer to the article for more details on the implementation and a comparative study with other works.
Usage
Installation
- Install dependencies with
pip install -r requirements.txt
. - Prepare data for VoxCeleb, MUSAN, and RIR datasets following voxceleb_trainer.
- Download WavLM-Base+ model and place
WavLM-Base+.pt
at the root folder.
Training
Step 1: Extract DINO speaker embeddings
The code to train the DINO model is not currently provided. We recommend using sslsv or 3D-Speaker to extract initial speaker embeddings.
Alternatively, you can directly download the DINO embeddings we used for our system: dino_vox2_embeddings.pt.
Note: the embeddings file must be a Dict[str, torch.Tensor]
representing all VoxCeleb2 samples with the following format for keys: id00012/21Uxsk56VDQ/00001.wav
.
Step 2: Generate pseudo-labels
python pseudo_labeling.py PATH_TO_EMBEDDINGS_FILE PATH_TO_PL_FILE
Step 3: Fine-tune WavLM MHFA
python trainSpeakerNet.py --config configs/wavlm_mhfa_dlg_lc.yaml --train_list PATH_TO_PL_FILE --distributed
Iterative process
Extract embeddings from the WavLM MHFA model:
python trainSpeakerNet_Eval.py --config configs/wavlm_mhfa_dlg_lc.yaml --generate_embeddings --embeddings_path PATH_TO_EMBEDDINGS_FILE
.Repeat steps 2 and 3. Make sure to change
save_path
in the config to avoid overwriting the existing model.
Step 4: Large-Margin Fine-Tuning
Copy the latest model checkpoint to
exp/wavlm_mhfa_dlg_lc_lmft/model
to resume training.Start training:
python trainSpeakerNet.py --config configs/wavlm_mhfa_dlg_lc_lmft.yaml --train_list PATH_TO_PL_FILE --distributed
.
Evaluation
python trainSpeakerNet_Eval.py --config configs/wavlm_mhfa_dlg_lc_lmft.yaml --eval
Model weights
The checkpoint of our best model reaching 0.99% EER on VoxCeleb1-O is available for download: wavlm_mhfa_dlg_lc_lmft
.
Acknowledgements
This repository contains third-party components and code adapted from other open-source projects, including: SLT22_MultiHead-Factorized-Attentive-Pooling and Loss-Gated-Learning.
Citation
If you use this project, please consider starring this repository on GitHub and citing the following paper.
@InProceedings{miara2024WavLMSSLSV,
author = {Miara, Victor and Lepage, Théo and Dehak, Réda},
booktitle = {INTERSPEECH},
title = {Towards Supervised Performance on Speaker Verification with Self-Supervised Learning by Leveraging Large-Scale ASR Models},
year = {2024},
url = {https://arxiv.org/abs/2406.02285},
}
License
This project is released under the MIT License.