Update README.md
Browse files
README.md
CHANGED
@@ -6,87 +6,190 @@ tags:
|
|
6 |
model-index:
|
7 |
- name: speecht5_tts-wolof
|
8 |
results: []
|
|
|
|
|
|
|
|
|
|
|
9 |
---
|
10 |
|
11 |
-
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
12 |
-
should probably proofread and complete it, then remove this comment. -->
|
13 |
-
|
14 |
# speecht5_tts-wolof
|
15 |
|
16 |
-
This model is a fine-tuned version of [
|
17 |
-
It achieves the following results on the evaluation set:
|
18 |
-
- Loss: 0.3697
|
19 |
|
20 |
## Model description
|
21 |
|
22 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
|
24 |
## Intended uses & limitations
|
25 |
|
26 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
|
28 |
## Training and evaluation data
|
29 |
|
30 |
-
|
31 |
|
32 |
## Training procedure
|
33 |
|
34 |
### Training hyperparameters
|
35 |
|
36 |
The following hyperparameters were used during training:
|
37 |
-
|
38 |
-
-
|
39 |
-
-
|
40 |
-
-
|
41 |
-
-
|
42 |
-
-
|
43 |
-
-
|
44 |
-
-
|
45 |
-
-
|
46 |
-
-
|
47 |
-
-
|
|
|
48 |
|
49 |
### Training results
|
50 |
|
51 |
-
| Training Loss |
|
52 |
-
|
53 |
-
|
|
54 |
-
|
|
55 |
-
| 0.
|
56 |
-
| 0.
|
57 |
-
| 0.
|
58 |
-
| 0.
|
59 |
-
|
60 |
-
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
| 0.3945 | 20.9998 | 11815 | 0.3713 |
|
74 |
-
| 0.3929 | 21.9987 | 12377 | 0.3720 |
|
75 |
-
| 0.3919 | 22.9993 | 12940 | 0.3736 |
|
76 |
-
| 0.3907 | 24.0 | 13503 | 0.3702 |
|
77 |
-
| 0.3893 | 24.9989 | 14065 | 0.3700 |
|
78 |
-
| 0.3894 | 25.9996 | 14628 | 0.3707 |
|
79 |
-
| 0.39 | 26.9984 | 15190 | 0.3687 |
|
80 |
-
| 0.3858 | 27.9991 | 15753 | 0.3712 |
|
81 |
-
| 0.3874 | 28.9998 | 16316 | 0.3669 |
|
82 |
-
| 0.3887 | 29.9987 | 16878 | 0.3685 |
|
83 |
-
| 0.3854 | 30.9993 | 17441 | 0.3670 |
|
84 |
-
| 0.3856 | 32.0 | 18004 | 0.3697 |
|
85 |
-
|
86 |
-
|
87 |
-
### Framework versions
|
88 |
-
|
89 |
-
- Transformers 4.41.2
|
90 |
-
- Pytorch 2.4.0+cu121
|
91 |
-
- Datasets 3.2.0
|
92 |
-
- Tokenizers 0.19.1
|
|
|
6 |
model-index:
|
7 |
- name: speecht5_tts-wolof
|
8 |
results: []
|
9 |
+
datasets:
|
10 |
+
- galsenai/wolof_tts
|
11 |
+
language:
|
12 |
+
- wo
|
13 |
+
pipeline_tag: text-to-speech
|
14 |
---
|
15 |
|
|
|
|
|
|
|
16 |
# speecht5_tts-wolof
|
17 |
|
18 |
+
This model is a fine-tuned version of [SpeechT5](https://huggingface.co/microsoft/speecht5_tts) for Text-to-Speech (TTS) on a Wolof dataset. It uses a custom tokenizer designed for Wolof and adjusts the baseline model's configuration to account for the new vocabulary introduced by the custom tokenizer. This version of SpeechT5 provides speech synthesis capabilities specifically tuned for the Wolof language.
|
|
|
|
|
19 |
|
20 |
## Model description
|
21 |
|
22 |
+
This model is based on the `SpeechT5` architecture, which integrates both speech recognition and synthesis into a unified framework. It is fine-tuned for Text-to-Speech (TTS) using a custom-trained tokenizer and an adapted configuration that accounts for the unique vocabulary of the Wolof language. The fine-tuning process was carried out using a dataset containing text in Wolof to help the model synthesize speech that captures the nuances of the language.
|
23 |
+
|
24 |
+
---
|
25 |
+
|
26 |
+
### Installation Instructions for Users
|
27 |
+
|
28 |
+
To install the necessary dependencies, run the following command:
|
29 |
+
|
30 |
+
```bash
|
31 |
+
!pip install transformers datasets
|
32 |
+
```
|
33 |
+
|
34 |
+
### Model Loading and Speech Generation Code
|
35 |
+
|
36 |
+
```python
|
37 |
+
import torch
|
38 |
+
from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor
|
39 |
+
from transformers import SpeechT5HifiGan
|
40 |
+
|
41 |
+
def load_speech_model(checkpoint="bilalfaye/speecht5_tts-wolof", vocoder_checkpoint="microsoft/speecht5_hifigan"):
|
42 |
+
"""
|
43 |
+
Load the SpeechT5 model, processor, and vocoder for text-to-speech.
|
44 |
+
|
45 |
+
Args:
|
46 |
+
checkpoint (str): The model checkpoint for SpeechT5 TTS.
|
47 |
+
vocoder_checkpoint (str): The checkpoint for the HiFi-GAN vocoder.
|
48 |
+
|
49 |
+
Returns:
|
50 |
+
processor: The processor for the model.
|
51 |
+
model: The loaded SpeechT5 model.
|
52 |
+
vocoder: The loaded HiFi-GAN vocoder.
|
53 |
+
device: The device (CPU or GPU) the model is loaded on.
|
54 |
+
"""
|
55 |
+
# Check for GPU availability and set device accordingly
|
56 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
57 |
+
|
58 |
+
# Load the SpeechT5 processor and model
|
59 |
+
processor = SpeechT5Processor.from_pretrained(checkpoint)
|
60 |
+
model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint).to(device) # Move model to the correct device
|
61 |
+
|
62 |
+
# Load the HiFi-GAN vocoder
|
63 |
+
vocoder = SpeechT5HifiGan.from_pretrained(vocoder_checkpoint).to(device) # Move vocoder to the correct device
|
64 |
+
|
65 |
+
return processor, model, vocoder, device
|
66 |
+
|
67 |
+
# Example usage
|
68 |
+
processor, model, vocoder, device = load_speech_model()
|
69 |
+
|
70 |
+
# Verify the device being used
|
71 |
+
print(f"Model and vocoder loaded on device: {device}")
|
72 |
+
|
73 |
+
from datasets import load_dataset
|
74 |
+
# Load speaker embeddings (this dataset contains speaker-specific embeddings)
|
75 |
+
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
|
76 |
+
speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
|
77 |
+
|
78 |
+
import torch
|
79 |
+
from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor, SpeechT5HifiGan
|
80 |
+
from IPython.display import Audio, display
|
81 |
+
|
82 |
+
def generate_speech_from_text(text,
|
83 |
+
speaker_embedding=speaker_embedding,
|
84 |
+
processor=processor,
|
85 |
+
model=model,
|
86 |
+
vocoder=vocoder):
|
87 |
+
"""
|
88 |
+
Generates speech from a given text using SpeechT5 and HiFi-GAN vocoder.
|
89 |
+
|
90 |
+
Args:
|
91 |
+
text (str): The input text to be converted to speech.
|
92 |
+
checkpoint (str): The model checkpoint for SpeechT5 TTS.
|
93 |
+
vocoder_checkpoint (str): The checkpoint for the HiFi-GAN vocoder.
|
94 |
+
speaker_embedding (torch.Tensor): The speaker embedding tensor.
|
95 |
+
processor (SpeechT5Processor): The processor for the model.
|
96 |
+
model (SpeechT5ForTextToSpeech): The loaded SpeechT5 model.
|
97 |
+
vocoder (SpeechT5HifiGan): The loaded HiFi-GAN vocoder.
|
98 |
+
|
99 |
+
Returns:
|
100 |
+
None
|
101 |
+
"""
|
102 |
+
# Parameters for text-to-speech generation
|
103 |
+
max_text_positions = model.config.max_text_positions # Token limit
|
104 |
+
max_length = model.config.max_length * 1.2 # Slightly extended max_length
|
105 |
+
min_length = max_length // 3 # Adjust based on max_length
|
106 |
+
num_beams = 7 # Use beam search for better quality
|
107 |
+
temperature = 0.6 # Reduce temperature for stability
|
108 |
+
|
109 |
+
# Tokenize the input text and move input tensor to the correct device
|
110 |
+
inputs = processor(text=text, return_tensors="pt", padding=True, truncation=True, max_length=max_text_positions)
|
111 |
+
inputs = {key: value.to(model.device) for key, value in inputs.items()} # Move inputs to device
|
112 |
+
|
113 |
+
# Generate speech
|
114 |
+
speech = model.generate(
|
115 |
+
inputs["input_ids"],
|
116 |
+
speaker_embeddings=speaker_embedding.to(model.device), # Ensure speaker_embedding is also on the correct device
|
117 |
+
vocoder=vocoder,
|
118 |
+
max_length=int(max_length),
|
119 |
+
min_length=int(min_length),
|
120 |
+
num_beams=num_beams,
|
121 |
+
temperature=temperature,
|
122 |
+
no_repeat_ngram_size=3,
|
123 |
+
repetition_penalty=1.5,
|
124 |
+
eos_token_id=None,
|
125 |
+
use_cache=True
|
126 |
+
)
|
127 |
+
|
128 |
+
# Detach the speech from the computation graph and move it to CPU
|
129 |
+
speech = speech.detach().cpu().numpy()
|
130 |
+
|
131 |
+
# Play the generated speech using IPython Audio
|
132 |
+
display(Audio(speech, rate=16000))
|
133 |
+
|
134 |
+
|
135 |
+
# Example usage
|
136 |
+
text = "ñu ne ñoom ñooy nattukaay satélite yi"
|
137 |
+
generate_speech_from_text(text)
|
138 |
+
```
|
139 |
|
140 |
## Intended uses & limitations
|
141 |
|
142 |
+
### Intended uses:
|
143 |
+
- **Text-to-Speech Generation**: This model can be used to convert Wolof text into natural-sounding speech. It can be integrated into applications that require voice interfaces, virtual assistants, or voice synthesis for Wolof-speaking communities.
|
144 |
+
|
145 |
+
### Limitations:
|
146 |
+
- **Limited Scope**: The model has been specifically fine-tuned for Wolof and may not perform well with other languages or accents.
|
147 |
+
- **Data Availability**: While the model was fine-tuned on a Wolof dataset, the quality of the generated speech may vary depending on the complexity of the input text and the dataset used for training.
|
148 |
+
- **Vocabulary and Tokenizer Constraints**: The tokenizer was specially trained for Wolof, so it may not handle out-of-vocabulary words or unknown characters effectively.
|
149 |
|
150 |
## Training and evaluation data
|
151 |
|
152 |
+
The model was fine-tuned on a custom dataset consisting of text in the Wolof language. This dataset was used to adjust the model to generate speech that accurately reflects the phonetic and syntactic properties of Wolof.
|
153 |
|
154 |
## Training procedure
|
155 |
|
156 |
### Training hyperparameters
|
157 |
|
158 |
The following hyperparameters were used during training:
|
159 |
+
|
160 |
+
- **Learning Rate**: 1e-05
|
161 |
+
- **Training Batch Size**: 8
|
162 |
+
- **Evaluation Batch Size**: 2
|
163 |
+
- **Seed**: 42
|
164 |
+
- **Gradient Accumulation Steps**: 8
|
165 |
+
- **Total Train Batch Size**: 64
|
166 |
+
- **Optimizer**: Adam with betas=(0.9, 0.999) and epsilon=1e-08
|
167 |
+
- **Learning Rate Scheduler Type**: Linear
|
168 |
+
- **Warmup Steps**: 500
|
169 |
+
- **Training Steps**: 255000
|
170 |
+
- **Mixed Precision Training**: Native AMP
|
171 |
|
172 |
### Training results
|
173 |
|
174 |
+
| Epoch | Training Loss | Validation Loss |
|
175 |
+
|-------|---------------|-----------------|
|
176 |
+
| 26 | 0.3894 | 0.3687 |
|
177 |
+
| 27 | 0.3858 | 0.3712 |
|
178 |
+
| 28 | 0.3874 | 0.3669 |
|
179 |
+
| 29 | 0.3887 | 0.3685 |
|
180 |
+
| 30 | 0.3854 | 0.3670 |
|
181 |
+
| 32 | 0.3856 | 0.3697 |
|
182 |
+
|
183 |
+
The evaluation table only includes the last 5 epochs as requested.
|
184 |
+
|
185 |
+
### Framework version
|
186 |
+
|
187 |
+
- **Transformers**: 4.41.2
|
188 |
+
- **PyTorch**: 2.4.0+cu121
|
189 |
+
- **Datasets**: 3.2.0
|
190 |
+
- **Tokenizers**: 0.19.1
|
191 |
+
|
192 |
+
|
193 |
+
# Author
|
194 |
+
|
195 |
+
- **Bilal FAYE**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|