bilalfaye commited on
Commit
7e6c1ac
·
verified ·
1 Parent(s): 14e8689

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +165 -62
README.md CHANGED
@@ -6,87 +6,190 @@ tags:
6
  model-index:
7
  - name: speecht5_tts-wolof
8
  results: []
 
 
 
 
 
9
  ---
10
 
11
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
12
- should probably proofread and complete it, then remove this comment. -->
13
-
14
  # speecht5_tts-wolof
15
 
16
- This model is a fine-tuned version of [microsoft/speecht5_tts](https://huggingface.co/microsoft/speecht5_tts) on an unknown dataset.
17
- It achieves the following results on the evaluation set:
18
- - Loss: 0.3697
19
 
20
  ## Model description
21
 
22
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
  ## Intended uses & limitations
25
 
26
- More information needed
 
 
 
 
 
 
27
 
28
  ## Training and evaluation data
29
 
30
- More information needed
31
 
32
  ## Training procedure
33
 
34
  ### Training hyperparameters
35
 
36
  The following hyperparameters were used during training:
37
- - learning_rate: 1e-05
38
- - train_batch_size: 8
39
- - eval_batch_size: 2
40
- - seed: 42
41
- - gradient_accumulation_steps: 8
42
- - total_train_batch_size: 64
43
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
44
- - lr_scheduler_type: linear
45
- - lr_scheduler_warmup_steps: 500
46
- - training_steps: 255000
47
- - mixed_precision_training: Native AMP
 
48
 
49
  ### Training results
50
 
51
- | Training Loss | Epoch | Step | Validation Loss |
52
- |:-------------:|:-------:|:-----:|:---------------:|
53
- | 0.5033 | 0.9989 | 562 | 0.4592 |
54
- | 0.471 | 1.9996 | 1125 | 0.4319 |
55
- | 0.4603 | 2.9984 | 1687 | 0.4224 |
56
- | 0.4496 | 3.9991 | 2250 | 0.4135 |
57
- | 0.4418 | 4.9998 | 2813 | 0.4090 |
58
- | 0.4296 | 5.9987 | 3375 | 0.4001 |
59
- | 0.425 | 6.9993 | 3938 | 0.3950 |
60
- | 0.4197 | 8.0 | 4501 | 0.3927 |
61
- | 0.4173 | 8.9989 | 5063 | 0.3888 |
62
- | 0.4133 | 9.9996 | 5626 | 0.3852 |
63
- | 0.4112 | 10.9984 | 6188 | 0.3832 |
64
- | 0.4072 | 11.9991 | 6751 | 0.3808 |
65
- | 0.404 | 12.9998 | 7314 | 0.3788 |
66
- | 0.4055 | 13.9987 | 7876 | 0.3792 |
67
- | 0.401 | 14.9993 | 8439 | 0.3759 |
68
- | 0.3988 | 16.0 | 9002 | 0.3755 |
69
- | 0.3984 | 16.9989 | 9564 | 0.3761 |
70
- | 0.3992 | 17.9996 | 10127 | 0.3735 |
71
- | 0.392 | 18.9984 | 10689 | 0.3731 |
72
- | 0.393 | 19.9991 | 11252 | 0.3730 |
73
- | 0.3945 | 20.9998 | 11815 | 0.3713 |
74
- | 0.3929 | 21.9987 | 12377 | 0.3720 |
75
- | 0.3919 | 22.9993 | 12940 | 0.3736 |
76
- | 0.3907 | 24.0 | 13503 | 0.3702 |
77
- | 0.3893 | 24.9989 | 14065 | 0.3700 |
78
- | 0.3894 | 25.9996 | 14628 | 0.3707 |
79
- | 0.39 | 26.9984 | 15190 | 0.3687 |
80
- | 0.3858 | 27.9991 | 15753 | 0.3712 |
81
- | 0.3874 | 28.9998 | 16316 | 0.3669 |
82
- | 0.3887 | 29.9987 | 16878 | 0.3685 |
83
- | 0.3854 | 30.9993 | 17441 | 0.3670 |
84
- | 0.3856 | 32.0 | 18004 | 0.3697 |
85
-
86
-
87
- ### Framework versions
88
-
89
- - Transformers 4.41.2
90
- - Pytorch 2.4.0+cu121
91
- - Datasets 3.2.0
92
- - Tokenizers 0.19.1
 
6
  model-index:
7
  - name: speecht5_tts-wolof
8
  results: []
9
+ datasets:
10
+ - galsenai/wolof_tts
11
+ language:
12
+ - wo
13
+ pipeline_tag: text-to-speech
14
  ---
15
 
 
 
 
16
  # speecht5_tts-wolof
17
 
18
+ This model is a fine-tuned version of [SpeechT5](https://huggingface.co/microsoft/speecht5_tts) for Text-to-Speech (TTS) on a Wolof dataset. It uses a custom tokenizer designed for Wolof and adjusts the baseline model's configuration to account for the new vocabulary introduced by the custom tokenizer. This version of SpeechT5 provides speech synthesis capabilities specifically tuned for the Wolof language.
 
 
19
 
20
  ## Model description
21
 
22
+ This model is based on the `SpeechT5` architecture, which integrates both speech recognition and synthesis into a unified framework. It is fine-tuned for Text-to-Speech (TTS) using a custom-trained tokenizer and an adapted configuration that accounts for the unique vocabulary of the Wolof language. The fine-tuning process was carried out using a dataset containing text in Wolof to help the model synthesize speech that captures the nuances of the language.
23
+
24
+ ---
25
+
26
+ ### Installation Instructions for Users
27
+
28
+ To install the necessary dependencies, run the following command:
29
+
30
+ ```bash
31
+ !pip install transformers datasets
32
+ ```
33
+
34
+ ### Model Loading and Speech Generation Code
35
+
36
+ ```python
37
+ import torch
38
+ from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor
39
+ from transformers import SpeechT5HifiGan
40
+
41
+ def load_speech_model(checkpoint="bilalfaye/speecht5_tts-wolof", vocoder_checkpoint="microsoft/speecht5_hifigan"):
42
+ """
43
+ Load the SpeechT5 model, processor, and vocoder for text-to-speech.
44
+
45
+ Args:
46
+ checkpoint (str): The model checkpoint for SpeechT5 TTS.
47
+ vocoder_checkpoint (str): The checkpoint for the HiFi-GAN vocoder.
48
+
49
+ Returns:
50
+ processor: The processor for the model.
51
+ model: The loaded SpeechT5 model.
52
+ vocoder: The loaded HiFi-GAN vocoder.
53
+ device: The device (CPU or GPU) the model is loaded on.
54
+ """
55
+ # Check for GPU availability and set device accordingly
56
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
57
+
58
+ # Load the SpeechT5 processor and model
59
+ processor = SpeechT5Processor.from_pretrained(checkpoint)
60
+ model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint).to(device) # Move model to the correct device
61
+
62
+ # Load the HiFi-GAN vocoder
63
+ vocoder = SpeechT5HifiGan.from_pretrained(vocoder_checkpoint).to(device) # Move vocoder to the correct device
64
+
65
+ return processor, model, vocoder, device
66
+
67
+ # Example usage
68
+ processor, model, vocoder, device = load_speech_model()
69
+
70
+ # Verify the device being used
71
+ print(f"Model and vocoder loaded on device: {device}")
72
+
73
+ from datasets import load_dataset
74
+ # Load speaker embeddings (this dataset contains speaker-specific embeddings)
75
+ embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
76
+ speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
77
+
78
+ import torch
79
+ from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor, SpeechT5HifiGan
80
+ from IPython.display import Audio, display
81
+
82
+ def generate_speech_from_text(text,
83
+ speaker_embedding=speaker_embedding,
84
+ processor=processor,
85
+ model=model,
86
+ vocoder=vocoder):
87
+ """
88
+ Generates speech from a given text using SpeechT5 and HiFi-GAN vocoder.
89
+
90
+ Args:
91
+ text (str): The input text to be converted to speech.
92
+ checkpoint (str): The model checkpoint for SpeechT5 TTS.
93
+ vocoder_checkpoint (str): The checkpoint for the HiFi-GAN vocoder.
94
+ speaker_embedding (torch.Tensor): The speaker embedding tensor.
95
+ processor (SpeechT5Processor): The processor for the model.
96
+ model (SpeechT5ForTextToSpeech): The loaded SpeechT5 model.
97
+ vocoder (SpeechT5HifiGan): The loaded HiFi-GAN vocoder.
98
+
99
+ Returns:
100
+ None
101
+ """
102
+ # Parameters for text-to-speech generation
103
+ max_text_positions = model.config.max_text_positions # Token limit
104
+ max_length = model.config.max_length * 1.2 # Slightly extended max_length
105
+ min_length = max_length // 3 # Adjust based on max_length
106
+ num_beams = 7 # Use beam search for better quality
107
+ temperature = 0.6 # Reduce temperature for stability
108
+
109
+ # Tokenize the input text and move input tensor to the correct device
110
+ inputs = processor(text=text, return_tensors="pt", padding=True, truncation=True, max_length=max_text_positions)
111
+ inputs = {key: value.to(model.device) for key, value in inputs.items()} # Move inputs to device
112
+
113
+ # Generate speech
114
+ speech = model.generate(
115
+ inputs["input_ids"],
116
+ speaker_embeddings=speaker_embedding.to(model.device), # Ensure speaker_embedding is also on the correct device
117
+ vocoder=vocoder,
118
+ max_length=int(max_length),
119
+ min_length=int(min_length),
120
+ num_beams=num_beams,
121
+ temperature=temperature,
122
+ no_repeat_ngram_size=3,
123
+ repetition_penalty=1.5,
124
+ eos_token_id=None,
125
+ use_cache=True
126
+ )
127
+
128
+ # Detach the speech from the computation graph and move it to CPU
129
+ speech = speech.detach().cpu().numpy()
130
+
131
+ # Play the generated speech using IPython Audio
132
+ display(Audio(speech, rate=16000))
133
+
134
+
135
+ # Example usage
136
+ text = "ñu ne ñoom ñooy nattukaay satélite yi"
137
+ generate_speech_from_text(text)
138
+ ```
139
 
140
  ## Intended uses & limitations
141
 
142
+ ### Intended uses:
143
+ - **Text-to-Speech Generation**: This model can be used to convert Wolof text into natural-sounding speech. It can be integrated into applications that require voice interfaces, virtual assistants, or voice synthesis for Wolof-speaking communities.
144
+
145
+ ### Limitations:
146
+ - **Limited Scope**: The model has been specifically fine-tuned for Wolof and may not perform well with other languages or accents.
147
+ - **Data Availability**: While the model was fine-tuned on a Wolof dataset, the quality of the generated speech may vary depending on the complexity of the input text and the dataset used for training.
148
+ - **Vocabulary and Tokenizer Constraints**: The tokenizer was specially trained for Wolof, so it may not handle out-of-vocabulary words or unknown characters effectively.
149
 
150
  ## Training and evaluation data
151
 
152
+ The model was fine-tuned on a custom dataset consisting of text in the Wolof language. This dataset was used to adjust the model to generate speech that accurately reflects the phonetic and syntactic properties of Wolof.
153
 
154
  ## Training procedure
155
 
156
  ### Training hyperparameters
157
 
158
  The following hyperparameters were used during training:
159
+
160
+ - **Learning Rate**: 1e-05
161
+ - **Training Batch Size**: 8
162
+ - **Evaluation Batch Size**: 2
163
+ - **Seed**: 42
164
+ - **Gradient Accumulation Steps**: 8
165
+ - **Total Train Batch Size**: 64
166
+ - **Optimizer**: Adam with betas=(0.9, 0.999) and epsilon=1e-08
167
+ - **Learning Rate Scheduler Type**: Linear
168
+ - **Warmup Steps**: 500
169
+ - **Training Steps**: 255000
170
+ - **Mixed Precision Training**: Native AMP
171
 
172
  ### Training results
173
 
174
+ | Epoch | Training Loss | Validation Loss |
175
+ |-------|---------------|-----------------|
176
+ | 26 | 0.3894 | 0.3687 |
177
+ | 27 | 0.3858 | 0.3712 |
178
+ | 28 | 0.3874 | 0.3669 |
179
+ | 29 | 0.3887 | 0.3685 |
180
+ | 30 | 0.3854 | 0.3670 |
181
+ | 32 | 0.3856 | 0.3697 |
182
+
183
+ The evaluation table only includes the last 5 epochs as requested.
184
+
185
+ ### Framework version
186
+
187
+ - **Transformers**: 4.41.2
188
+ - **PyTorch**: 2.4.0+cu121
189
+ - **Datasets**: 3.2.0
190
+ - **Tokenizers**: 0.19.1
191
+
192
+
193
+ # Author
194
+
195
+ - **Bilal FAYE**