license: mit
base_model: microsoft/speecht5_tts
tags:
- generated_from_trainer
model-index:
- name: speecht5_tts-wolof
results: []
datasets:
- galsenai/wolof_tts
language:
- wo
pipeline_tag: text-to-speech
speecht5_tts-wolof
This model is a fine-tuned version of SpeechT5 for Text-to-Speech (TTS) on a Wolof dataset. It uses a custom tokenizer designed for Wolof and adjusts the baseline model's configuration to account for the new vocabulary introduced by the custom tokenizer. This version of SpeechT5 provides speech synthesis capabilities specifically tuned for the Wolof language.
Model description
This model is based on the SpeechT5
architecture, which integrates both speech recognition and synthesis into a unified framework. It is fine-tuned for Text-to-Speech (TTS) using a custom-trained tokenizer and an adapted configuration that accounts for the unique vocabulary of the Wolof language. The fine-tuning process was carried out using a dataset containing text in Wolof to help the model synthesize speech that captures the nuances of the language.
Installation Instructions for Users
To install the necessary dependencies, run the following command:
!pip install transformers datasets
Model Loading and Speech Generation Code
import torch
from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor
from transformers import SpeechT5HifiGan
def load_speech_model(checkpoint="bilalfaye/speecht5_tts-wolof", vocoder_checkpoint="microsoft/speecht5_hifigan"):
"""
Load the SpeechT5 model, processor, and vocoder for text-to-speech.
Args:
checkpoint (str): The model checkpoint for SpeechT5 TTS.
vocoder_checkpoint (str): The checkpoint for the HiFi-GAN vocoder.
Returns:
processor: The processor for the model.
model: The loaded SpeechT5 model.
vocoder: The loaded HiFi-GAN vocoder.
device: The device (CPU or GPU) the model is loaded on.
"""
# Check for GPU availability and set device accordingly
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load the SpeechT5 processor and model
processor = SpeechT5Processor.from_pretrained(checkpoint)
model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint).to(device) # Move model to the correct device
# Load the HiFi-GAN vocoder
vocoder = SpeechT5HifiGan.from_pretrained(vocoder_checkpoint).to(device) # Move vocoder to the correct device
return processor, model, vocoder, device
# Example usage
processor, model, vocoder, device = load_speech_model()
# Verify the device being used
print(f"Model and vocoder loaded on device: {device}")
from datasets import load_dataset
# Load speaker embeddings (this dataset contains speaker-specific embeddings)
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
import torch
from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor, SpeechT5HifiGan
from IPython.display import Audio, display
def generate_speech_from_text(text,
speaker_embedding=speaker_embedding,
processor=processor,
model=model,
vocoder=vocoder):
"""
Generates speech from a given text using SpeechT5 and HiFi-GAN vocoder.
Args:
text (str): The input text to be converted to speech.
checkpoint (str): The model checkpoint for SpeechT5 TTS.
vocoder_checkpoint (str): The checkpoint for the HiFi-GAN vocoder.
speaker_embedding (torch.Tensor): The speaker embedding tensor.
processor (SpeechT5Processor): The processor for the model.
model (SpeechT5ForTextToSpeech): The loaded SpeechT5 model.
vocoder (SpeechT5HifiGan): The loaded HiFi-GAN vocoder.
Returns:
None
"""
# Parameters for text-to-speech generation
max_text_positions = model.config.max_text_positions # Token limit
max_length = model.config.max_length * 1.2 # Slightly extended max_length
min_length = max_length // 3 # Adjust based on max_length
num_beams = 7 # Use beam search for better quality
temperature = 0.6 # Reduce temperature for stability
# Tokenize the input text and move input tensor to the correct device
inputs = processor(text=text, return_tensors="pt", padding=True, truncation=True, max_length=max_text_positions)
inputs = {key: value.to(model.device) for key, value in inputs.items()} # Move inputs to device
# Generate speech
speech = model.generate(
inputs["input_ids"],
speaker_embeddings=speaker_embedding.to(model.device), # Ensure speaker_embedding is also on the correct device
vocoder=vocoder,
max_length=int(max_length),
min_length=int(min_length),
num_beams=num_beams,
temperature=temperature,
no_repeat_ngram_size=3,
repetition_penalty=1.5,
eos_token_id=None,
use_cache=True
)
# Detach the speech from the computation graph and move it to CPU
speech = speech.detach().cpu().numpy()
# Play the generated speech using IPython Audio
display(Audio(speech, rate=16000))
# Example usage
text = "ñu ne ñoom ñooy nattukaay satélite yi"
generate_speech_from_text(text)
Intended uses & limitations
Intended uses:
- Text-to-Speech Generation: This model can be used to convert Wolof text into natural-sounding speech. It can be integrated into applications that require voice interfaces, virtual assistants, or voice synthesis for Wolof-speaking communities.
Limitations:
- Limited Scope: The model has been specifically fine-tuned for Wolof and may not perform well with other languages or accents.
- Data Availability: While the model was fine-tuned on a Wolof dataset, the quality of the generated speech may vary depending on the complexity of the input text and the dataset used for training.
- Vocabulary and Tokenizer Constraints: The tokenizer was specially trained for Wolof, so it may not handle out-of-vocabulary words or unknown characters effectively.
Training and evaluation data
The model was fine-tuned on a custom dataset consisting of text in the Wolof language. This dataset was used to adjust the model to generate speech that accurately reflects the phonetic and syntactic properties of Wolof.
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- Learning Rate: 1e-05
- Training Batch Size: 8
- Evaluation Batch Size: 2
- Seed: 42
- Gradient Accumulation Steps: 8
- Total Train Batch Size: 64
- Optimizer: Adam with betas=(0.9, 0.999) and epsilon=1e-08
- Learning Rate Scheduler Type: Linear
- Warmup Steps: 500
- Training Steps: 255000
- Mixed Precision Training: Native AMP
Training results
Epoch | Training Loss | Validation Loss |
---|---|---|
26 | 0.3894 | 0.3687 |
27 | 0.3858 | 0.3712 |
28 | 0.3874 | 0.3669 |
29 | 0.3887 | 0.3685 |
30 | 0.3854 | 0.3670 |
32 | 0.3856 | 0.3697 |
The evaluation table only includes the last 5 epochs as requested.
Framework version
- Transformers: 4.41.2
- PyTorch: 2.4.0+cu121
- Datasets: 3.2.0
- Tokenizers: 0.19.1
Author
- Bilal FAYE