File size: 17,338 Bytes
ef12a74
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
![image](Utility/toucan.png)

IMS Toucan is a toolkit for teaching, training and using state-of-the-art Speech Synthesis models, developed at the
**Institute for Natural Language Processing (IMS), University of Stuttgart, Germany**. Everything is pure Python and
PyTorch based to keep it as simple and beginner-friendly, yet powerful as possible.

The PyTorch Modules of [Tacotron 2](https://arxiv.org/abs/1712.05884)
and [FastSpeech 2](https://arxiv.org/abs/2006.04558) are taken from
[ESPnet](https://github.com/espnet/espnet), the PyTorch Modules of [HiFiGAN](https://arxiv.org/abs/2010.05646) are taken
from the [ParallelWaveGAN repository](https://github.com/kan-bayashi/ParallelWaveGAN)
which are also authored by the brilliant [Tomoki Hayashi](https://github.com/kan-bayashi).

For a version of the toolkit that includes TransformerTTS instead of Tacotron 2 and MelGAN instead of HiFiGAN, check out
the TransformerTTS and MelGAN branch. They are separated to keep the code clean, simple and minimal.

---

## Contents

- [New Features](#new-features)
- [Demonstration](#demonstration)
- [Installation](#installation)
    + [Basic Requirements](#basic-requirements)
    + [Speaker Embedding](#speaker-embedding)
    + [espeak-ng](#espeak-ng)
- [Creating a new Pipeline](#creating-a-new-pipeline)
    * [Build a HiFi-GAN Pipeline](#build-a-hifi-gan-pipeline)
    * [Build a FastSpeech 2 Pipeline](#build-a-fastspeech-2-pipeline)
- [Training a Model](#training-a-model)
- [Creating a new InferenceInterface](#creating-a-new-inferenceinterface)
- [Using a trained Model for Inference](#using-a-trained-model-for-inference)
- [FAQ](#faq)
- [Citation](#citation)

---

## New Features

- [As shown in this paper](http://festvox.org/blizzard/bc2021/BC21_DelightfulTTS.pdf) vocoders can be used to perform
  super-resolution and spectrogram inversion simultaneously. We added this to our HiFi-GAN vocoder. It now takes 16kHz
  spectrograms as input, but produces 48kHz waveforms.
- We officially introduced IMS Toucan in
  [our contribution to the Blizzard Challenge 2021](http://festvox.org/blizzard/bc2021/BC21_IMS.pdf). Check out the
  bottom of the readme for a bibtex entry.
- We now use articulatory representations of phonemes as the input for all models. This allows us to easily use
  multilingual data.
- We provide a checkpoint trained with [model agnostic meta learning](https://arxiv.org/abs/1703.03400) from which you
  should be able to fine-tune a model with very little data in almost any language.
- We now use a small self-contained Aligner that is trained with CTC, inspired by
  [this implementation](https://github.com/as-ideas/DeepForcedAligner). This allows us to get rid of the dependence on
  autoregressive models. Tacotron 2 is thus now also no longer in this branch, but still present in other branches,
  similar to TransformerTTS.

---

## Demonstration

[Here are two sentences](https://drive.google.com/file/d/1ltAyR2EwAbmDo2hgkx1mvUny4FuxYmru/view?usp=sharing)
produced by Tacotron 2 combined with HiFi-GAN, trained on
[Nancy Krebs](https://www.cstr.ed.ac.uk/projects/blizzard/2011/lessac_blizzard2011/) using this toolkit.

[Here is some speech](https://drive.google.com/file/d/1mZ1LvTlY6pJ5ZQ4UXZ9jbzB651mufBrB/view?usp=sharing)
produced by FastSpeech 2 and MelGAN trained on [LJSpeech](https://keithito.com/LJ-Speech-Dataset/)
using this toolkit.

And [here is a sentence](https://drive.google.com/file/d/1FT49Jf0yyibwMDbsEJEO9mjwHkHRIGXc/view?usp=sharing)
produced by TransformerTTS and MelGAN trained on [Thorsten](https://github.com/thorstenMueller/deep-learning-german-tts)
using this toolkit.

[Here is some speech](https://drive.google.com/file/d/14nPo2o1VKtWLPGF7e_0TxL8XGI3n7tAs/view?usp=sharing)
produced by a multi-speaker FastSpeech 2 with MelGAN trained on
[LibriTTS](https://research.google/tools/datasets/libri-tts/) using this toolkit. Fans of the videogame Portal may
recognize who was used as the reference speaker for this utterance.

[Interactive Demo of our entry to the Blizzard Challenge 2021.](https://colab.research.google.com/drive/1bRaySf8U55MRPaxqBr8huWrzCOzlxVqw)
This is based on an older version of the toolkit though. It uses FastSpeech2 and MelGAN as vocoder and is trained on 5
hours of Spanish.

---

## Installation

#### Basic Requirements

To install this toolkit, clone it onto the machine you want to use it on
(should have at least one GPU if you intend to train models on that machine. For inference, you can get by without GPU).
Navigate to the directory you have cloned. We are going to create and activate a
[conda virtual environment](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html)
to install the basic requirements into. After creating the environment, the command you need to use to activate the
virtual environment is displayed. The commands below show everything you need to do.

```
conda create --prefix ./toucan_conda_venv --no-default-packages python=3.8

pip install --no-cache-dir -r requirements.txt

pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
```

#### Speaker Embedding

As [NVIDIA has shown](https://arxiv.org/pdf/2110.05798.pdf), you get better results by fine-tuning a pretrained model on
a new speaker, rather than training a multispeaker model. We have thus dropped support for zero-shot multispeaker models
using speaker embeddings. However we still
use [Speechbrain's ECAPA-TDNN](https://huggingface.co./speechbrain/spkrec-ecapa-voxceleb) for a cycle consistency loss to
make adapting to new speakers a bit faster.

In the current version of the toolkit no further action should be required. When you are using multispeaker for the
first time, it requires an internet connection to download the pretrained models though.

#### espeak-ng

And finally you need to have espeak-ng installed on your system, because it is used as backend for the phonemizer. If
you replace the phonemizer, you don't need it. On most Linux environments it will be installed already, and if it is
not, and you have the sufficient rights, you can install it by simply running

```
apt-get install espeak-ng
```

---

## Creating a new Pipeline

To create a new pipeline to train a HiFiGAN vocoder, you only need a set of audio files. To create a new pipeline for a
FastSpeech 2, you need audio files, corresponding text labels, and an already trained Aligner model to estimate the
duration information that FastSpeech 2 needs as input. Let's go through them in order of increasing complexity.

### Build a HiFi-GAN Pipeline

In the directory called
*Utility* there is a file called
*file_lists.py*. In this file you should write a function that returns a list of all the absolute paths to each of the
audio files in your dataset as strings.

Then go to the directory
*TrainingInterfaces/TrainingPipelines*. In there, make a copy of any existing pipeline that has HiFiGAN in its name. We
will use this as reference and only make the necessary changes to use the new dataset. Import the function you have just
written as
*get_file_list*. Now look out for a variable called
*model_save_dir*. This is the default directory that checkpoints will be saved into, unless you specify another one when
calling the training script. Change it to whatever you like.

Now you need to add your newly created pipeline to the pipeline dictionary in the file
*run_training_pipeline.py* in the top level of the toolkit. In this file, import the
*run* function from the pipeline you just created and give it a speaking name. Now in the
*pipeline_dict*, add your imported function as value and use as key a shorthand that makes sense. And just like that
you're done.

### Build a FastSpeech 2 Pipeline

In the directory called
*Utility* there is a file called
*path_to_transcript_dicts.py*. In this file you should write a function that returns a dictionary that has all the
absolute paths to each of the audio files in your dataset as strings as the keys and the textual transcriptions of the
corresponding audios as the values.

Then go to the directory
*TrainingInterfaces/TrainingPipelines*. In there, make a copy of any existing pipeline that has FastSpeech 2 in its
name. We will use this copy as reference and only make the necessary changes to use the new dataset. Import the function
you have just written as
*build_path_to_transcript_dict*. Since the data will be processed a considerable amount, a cache will be built and saved
as file for quick and easy restarts. So find the variable
*cache_dir* and adapt it to your needs. The same goes for the variable
*save_dir*, which is where the checkpoints will be saved to. This is a default value, you can overwrite it when calling
the pipeline later using a command line argument, in case you want to fine-tune from a checkpoint and thus save into a
different directory.

In your new pipeline file, look out for the line in which the
*acoustic_model* is loaded. Change the path to the checkpoint of an Aligner model. It can either be the one that is
supplied with the toolkit in the download script, or one that you trained yourself. In the example pipelines, the one
that we provide is finetuned to the dataset it is applied to before it is used to extract durations.

Since we are using text here, we have to make sure that the text processing is adequate for the language. So check in
*Preprocessing/TextFrontend* whether the TextFrontend already has a language ID (e.g. 'en' and 'de') for the language of
your dataset. If not, you'll have to implement handling for that, but it should be pretty simple by just doing it
analogous to what is there already. Now back in the pipeline, change the
*lang* argument in the creation of the dataset and in the call to the train loop function to the language ID that
matches your data.

Now navigate to the implementation of the
*train_loop* that is called in the pipeline. In this file, find the function called
*plot_progress_spec*. This function will produce spectrogram plots during training, which is the most important way to
monitor the progress of the training. In there, you may need to add an example sentence for the language of the data you
are using. It should all be pretty clear from looking at it.

Once this is done, we are almost done, now we just need to make it available to the
*run_training_pipeline.py* file in the top level. In said file, import the
*run* function from the pipeline you just created and give it a speaking name. Now in the
*pipeline_dict*, add your imported function as value and use as key a shorthand that makes sense. And that's it.

---

## Training a Model

Once you have a pipeline built, training is super easy. Just activate your virtual environment and run the command
below. You might want to use something like nohup to keep it running after you log out from the server (then you should
also add -u as option to python) and add an & to start it in the background. Also, you might want to direct the std:out
and std:err into a file using > but all of that is just standard shell use and has nothing to do with the toolkit.

```
python run_training_pipeline.py <shorthand of the pipeline>
```

You can supply any of the following arguments, but don't have to (although for training you should definitely specify at
least a GPU ID).

```
--gpu_id <ID of the GPU you wish to use, as displayed with nvidia-smi, default is cpu> 

--resume_checkpoint <path to a checkpoint to load>

--resume (if this is present, the furthest checkpoint available will be loaded automatically)

--finetune (if this is present, the provided checkpoint will be fine-tuned on the data from this pipeline)

--model_save_dir <path to a directory where the checkpoints should be saved>
```

After every epoch, some logs will be written to the console. If the loss becomes NaN, you'll need to use a smaller
learning rate or more warmup steps in the arguments of the call to the training_loop in the pipeline you are running.

If you get cuda out of memory errors, you need to decrease the batchsize in the arguments of the call to the
training_loop in the pipeline you are running. Try decreasing the batchsize in small steps until you get no more out of
cuda memory errors. Decreasing the batchsize may also require you to use a smaller learning rate. The use of GroupNorm
should make it so that the training remains mostly stable.

Speaking of plots: in the directory you specified for saving model's checkpoint files and self-explanatory visualization
data will appear. Since the checkpoints are quite big, only the five most recent ones will be kept. Training will stop
after 500,000 for FastSpeech 2, and after 2,500,000 steps for HiFiGAN. Depending on the machine and configuration you
are using this will take multiple days, so verify that everything works on small tests before running the big thing. If
you want to stop earlier, just kill the process, since everything is daemonic all the child-processes should die with
it. In case there are some ghost-processes left behind, you can use the following command to find them and kill them
manually.

```
fuser -v /dev/nvidia*
```

After training is complete, it is recommended to run
*run_weight_averaging.py*. If you made no changes to the architectures and stuck to the default directory layout, it
will automatically load any models you produced with one pipeline, average their parameters to get a slightly more
robust model and save the result as
*best.pt* in the same directory where all the corresponding checkpoints lie. This also compresses the file size
significantly, so you should do this and then use the
*best.pt* model for inference.

---

## Creating a new InferenceInterface

To build a new
*InferenceInterface*, which you can then use for super simple inference, we're going to use an existing one as template
again. Make a copy of the
*InferenceInterface*. Change the name of the class in the copy and change the paths to the models to use the trained
models of your choice. Instantiate the model with the same hyperparameters that you used when you created it in the
corresponding training pipeline. The last thing to check is the language that you supply to the text frontend. Make sure
it matches what you used during training.

With your newly created
*InferenceInterface*, you can use your trained models pretty much anywhere, e.g. in other projects. All you need is the
*Utility* directory, the
*Layers*
directory, the
*Preprocessing* directory and the
*InferenceInterfaces* directory (and of course your model checkpoint). That's all the code you need, it works
standalone.

---

## Using a trained Model for Inference

An
*InferenceInterface* contains two useful methods. They are
*read_to_file* and
*read_aloud*.

- *read_to_file* takes as input a list of strings and a filename. It will synthesize the sentences in the list and
  concatenate them with a short pause inbetween and write them to the filepath you supply as the other argument.

- *read_aloud* takes just a string, which it will then convert to speech and immediately play using the system's
  speakers. If you set the optional argument
  *view* to
  *True* when calling it, it will also show a plot of the phonemes it produced, the spectrogram it came up with, and the
  wave it created from that spectrogram. So all the representations can be seen, text to phoneme, phoneme to spectrogram
  and finally spectrogram to wave.

Those methods are used in demo code in the toolkit. In
*run_interactive_demo.py* and
*run_text_to_file_reader.py*, you can import
*InferenceInterfaces* that you created and add them to the dictionary in each of the files with a shorthand that makes
sense. In the interactive demo, you can just call the python script, then type in the shorthand when prompted and
immediately listen to your synthesis saying whatever you put in next (be wary of out of memory errors for too long
inputs). In the text reader demo script you have to call the function that wraps around the
*InferenceInterface* and supply the shorthand of your choice. It should be pretty clear from looking at it.

---

## FAQ

Here are a few points that were brought up by users:

- My error message shows GPU0, even though I specified a different GPU - The way GPU selection works is that the
  specified GPU is set as the only visible device, in order to avoid backend stuff running accidentally on different
  GPUs. So internally the program will name the device GPU0, because it is the only GPU it can see. It is actually
  running on the GPU you specified.

---

This toolkit has been written by Florian Lux (except for the pytorch modules taken
from [ESPnet](https://github.com/espnet/espnet) and
[ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN), as mentioned above), so if you come across problems
or questions, feel free to [write a mail](mailto:[email protected]). Also let me know if you do something
cool with it. Thank you for reading.

## Citation

```
@inproceedings{lux2021toucan,
  title={{The IMS Toucan system for the Blizzard Challenge 2021}},
  author={Florian Lux and Julia Koch and Antje Schweitzer and Ngoc Thang Vu},
  year={2021},
  booktitle={Proc. Blizzard Challenge Workshop},
  volume={2021},
  publisher={{Speech Synthesis SIG}}
}
```