|
--- |
|
license: apache-2.0 |
|
tags: |
|
- audio |
|
- speech |
|
- speaker |
|
- speaker-recognition |
|
- speaker-embedding |
|
- speaker-verification |
|
- speaker-identification |
|
- speaker-encoder |
|
- tflite |
|
- voice |
|
library_name: sidlingvo |
|
--- |
|
|
|
# Conformer based multilingual speaker encoder |
|
|
|
## Summary |
|
|
|
This is a massively multilingual conformer-based speaker recognition model. |
|
|
|
The model was trained with public data only, using the GE2E loss. |
|
|
|
Papers: |
|
|
|
* Multilingual: https://arxiv.org/abs/2104.02125 |
|
* GE2E loss: https://arxiv.org/abs/1710.10467 |
|
|
|
``` |
|
@inproceedings{chojnacka2021speakerstew, |
|
title={{SpeakerStew: Scaling to many languages with a triaged multilingual text-dependent and text-independent speaker verification system}}, |
|
author={Chojnacka, Roza and Pelecanos, Jason and Wang, Quan and Moreno, Ignacio Lopez}, |
|
booktitle={Prod. Interspeech}, |
|
pages={1064--1068}, |
|
year={2021}, |
|
doi={10.21437/Interspeech.2021-646}, |
|
issn={2958-1796}, |
|
} |
|
|
|
@inproceedings{wan2018generalized, |
|
title={Generalized end-to-end loss for speaker verification}, |
|
author={Wan, Li and Wang, Quan and Papir, Alan and Moreno, Ignacio Lopez}, |
|
booktitle={International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, |
|
pages={4879--4883}, |
|
year={2018}, |
|
organization={IEEE} |
|
} |
|
``` |
|
|
|
## Usage |
|
|
|
Run use this model, you will need to use the `siglingvo` library: https://github.com/google/speaker-id/tree/master/lingvo |
|
|
|
Since lingvo does not support Python 3.11 yet, make sure your Python is up to 3.10. |
|
|
|
Install the library: |
|
|
|
``` |
|
pip install sidlingvo |
|
``` |
|
|
|
Example usage: |
|
|
|
```Python |
|
import os |
|
from sidlingvo import wav_to_dvector |
|
from huggingface_hub import hf_hub_download |
|
|
|
repo_id = "tflite-hub/conformer-speaker-encoder" |
|
model_path = "models" |
|
hf_hub_download(repo_id=repo_id, filename="vad_long_model.tflite", local_dir=model_path) |
|
hf_hub_download(repo_id=repo_id, filename="vad_long_mean_stddev.csv", local_dir=model_path) |
|
hf_hub_download(repo_id=repo_id, filename="conformer_tisid_medium.tflite", local_dir=model_path) |
|
|
|
enroll_wav_files = ["your_first_wav_file.wav"] |
|
test_wav_file = "your_second_wav_file.wav" |
|
runner = wav_to_dvector.WavToDvectorRunner( |
|
vad_model_file=os.path.join(model_path, "vad_long_model.tflite"), |
|
vad_mean_stddev_file=os.path.join(model_path, "vad_long_mean_stddev.csv"), |
|
tisid_model_file=os.path.join(model_path, "conformer_tisid_medium.tflite")) |
|
score = runner.compute_score(enroll_wav_files, test_wav_file) |
|
print("Speaker similarity score:", score) |
|
``` |