--- license: mit base_model: microsoft/speecht5_tts tags: - generated_from_trainer model-index: - name: speecht5_tts-wolof results: [] datasets: - galsenai/wolof_tts language: - wo pipeline_tag: text-to-speech --- # speecht5_tts-wolof This model is a fine-tuned version of [SpeechT5](https://huggingface.co./microsoft/speecht5_tts) for Text-to-Speech (TTS) on a Wolof dataset. It uses a custom tokenizer designed for Wolof and adjusts the baseline model's configuration to account for the new vocabulary introduced by the custom tokenizer. This version of SpeechT5 provides speech synthesis capabilities specifically tuned for the Wolof language. ## Model description This model is based on the `SpeechT5` architecture, which integrates both speech recognition and synthesis into a unified framework. It is fine-tuned for Text-to-Speech (TTS) using a custom-trained tokenizer and an adapted configuration that accounts for the unique vocabulary of the Wolof language. The fine-tuning process was carried out using a dataset containing text in Wolof to help the model synthesize speech that captures the nuances of the language. --- ### Installation Instructions for Users To install the necessary dependencies, run the following command: ```bash !pip install transformers datasets ``` ### Model Loading and Speech Generation Code ```python import torch from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor from transformers import SpeechT5HifiGan def load_speech_model(checkpoint="bilalfaye/speecht5_tts-wolof", vocoder_checkpoint="microsoft/speecht5_hifigan"): """ Load the SpeechT5 model, processor, and vocoder for text-to-speech. Args: checkpoint (str): The model checkpoint for SpeechT5 TTS. vocoder_checkpoint (str): The checkpoint for the HiFi-GAN vocoder. Returns: processor: The processor for the model. model: The loaded SpeechT5 model. vocoder: The loaded HiFi-GAN vocoder. device: The device (CPU or GPU) the model is loaded on. """ # Check for GPU availability and set device accordingly device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Load the SpeechT5 processor and model processor = SpeechT5Processor.from_pretrained(checkpoint) model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint).to(device) # Move model to the correct device # Load the HiFi-GAN vocoder vocoder = SpeechT5HifiGan.from_pretrained(vocoder_checkpoint).to(device) # Move vocoder to the correct device return processor, model, vocoder, device # Example usage processor, model, vocoder, device = load_speech_model() # Verify the device being used print(f"Model and vocoder loaded on device: {device}") from datasets import load_dataset # Load speaker embeddings (this dataset contains speaker-specific embeddings) embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation") speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0) import torch from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor, SpeechT5HifiGan from IPython.display import Audio, display def generate_speech_from_text(text, speaker_embedding=speaker_embedding, processor=processor, model=model, vocoder=vocoder): """ Generates speech from a given text using SpeechT5 and HiFi-GAN vocoder. Args: text (str): The input text to be converted to speech. checkpoint (str): The model checkpoint for SpeechT5 TTS. vocoder_checkpoint (str): The checkpoint for the HiFi-GAN vocoder. speaker_embedding (torch.Tensor): The speaker embedding tensor. processor (SpeechT5Processor): The processor for the model. model (SpeechT5ForTextToSpeech): The loaded SpeechT5 model. vocoder (SpeechT5HifiGan): The loaded HiFi-GAN vocoder. Returns: None """ # Parameters for text-to-speech generation max_text_positions = model.config.max_text_positions # Token limit max_length = model.config.max_length * 1.2 # Slightly extended max_length min_length = max_length // 3 # Adjust based on max_length num_beams = 7 # Use beam search for better quality temperature = 0.6 # Reduce temperature for stability # Tokenize the input text and move input tensor to the correct device inputs = processor(text=text, return_tensors="pt", padding=True, truncation=True, max_length=max_text_positions) inputs = {key: value.to(model.device) for key, value in inputs.items()} # Move inputs to device # Generate speech speech = model.generate( inputs["input_ids"], speaker_embeddings=speaker_embedding.to(model.device), # Ensure speaker_embedding is also on the correct device vocoder=vocoder, max_length=int(max_length), min_length=int(min_length), num_beams=num_beams, temperature=temperature, no_repeat_ngram_size=3, repetition_penalty=1.5, eos_token_id=None, use_cache=True ) # Detach the speech from the computation graph and move it to CPU speech = speech.detach().cpu().numpy() # Play the generated speech using IPython Audio display(Audio(speech, rate=16000)) # Example usage text = "ñu ne ñoom ñooy nattukaay satélite yi" generate_speech_from_text(text) ``` ## Intended uses & limitations ### Intended uses: - **Text-to-Speech Generation**: This model can be used to convert Wolof text into natural-sounding speech. It can be integrated into applications that require voice interfaces, virtual assistants, or voice synthesis for Wolof-speaking communities. ### Limitations: - **Limited Scope**: The model has been specifically fine-tuned for Wolof and may not perform well with other languages or accents. - **Data Availability**: While the model was fine-tuned on a Wolof dataset, the quality of the generated speech may vary depending on the complexity of the input text and the dataset used for training. - **Vocabulary and Tokenizer Constraints**: The tokenizer was specially trained for Wolof, so it may not handle out-of-vocabulary words or unknown characters effectively. ## Training and evaluation data The model was fine-tuned on a custom dataset consisting of text in the Wolof language. This dataset was used to adjust the model to generate speech that accurately reflects the phonetic and syntactic properties of Wolof. ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - **Learning Rate**: 1e-05 - **Training Batch Size**: 8 - **Evaluation Batch Size**: 2 - **Seed**: 42 - **Gradient Accumulation Steps**: 8 - **Total Train Batch Size**: 64 - **Optimizer**: Adam with betas=(0.9, 0.999) and epsilon=1e-08 - **Learning Rate Scheduler Type**: Linear - **Warmup Steps**: 500 - **Training Steps**: 255000 - **Mixed Precision Training**: Native AMP ### Training results | Epoch | Training Loss | Validation Loss | |-------|---------------|-----------------| | 26 | 0.3894 | 0.3687 | | 27 | 0.3858 | 0.3712 | | 28 | 0.3874 | 0.3669 | | 29 | 0.3887 | 0.3685 | | 30 | 0.3854 | 0.3670 | | 32 | 0.3856 | 0.3697 | The evaluation table only includes the last 5 epochs as requested. ### Framework version - **Transformers**: 4.41.2 - **PyTorch**: 2.4.0+cu121 - **Datasets**: 3.2.0 - **Tokenizers**: 0.19.1 # Author - **Bilal FAYE**