Nevertree
/

RTVC

Model card Files Files and versions Community

Nevertree commited on Feb 24, 2023

Commit

772f12c

1 Parent(s): f635cb7

Upload 20 files

Browse files

Files changed (20) hide show

LICENSE.md +24 -0
README.md +66 -0
demo_cli.py +208 -0
demo_toolbox.py +37 -0
encoder.zip +3 -0
encoder_preprocess.py +71 -0
encoder_train.py +44 -0
models.zip +3 -0
requirements.txt +0 -0
samples.zip +3 -0
synthesizer.zip +3 -0
synthesizer_preprocess_audio.py +47 -0
synthesizer_preprocess_embeds.py +25 -0
synthesizer_train.py +36 -0
toolbox.zip +3 -0
utils.zip +3 -0
vocoder.zip +3 -0
vocoder_preprocess.py +48 -0
vocoder_train (1).py +53 -0
vocoder_train.py +53 -0

LICENSE.md ADDED Viewed

	@@ -0,0 +1,24 @@

+MIT License
+Modified & original work Copyright (c) 2019 Corentin Jemine (https://github.com/CorentinJ)
+Original work Copyright (c) 2018 Rayhane Mama (https://github.com/Rayhane-mamah)
+Original work Copyright (c) 2019 fatchord (https://github.com/fatchord)
+Original work Copyright (c) 2015 braindead (https://github.com/braindead)
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md ADDED Viewed

	@@ -0,0 +1,66 @@

+# Real-Time Voice Cloning
+This repository is an implementation of [Transfer Learning from Speaker Verification to
+Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf) (SV2TTS) with a vocoder that works in real-time. This was my [master's thesis](https://matheo.uliege.be/handle/2268.2/6801).
+SV2TTS is a deep learning framework in three stages. In the first stage, one creates a digital representation of a voice from a few seconds of audio. In the second and third stages, this representation is used as reference to generate speech given arbitrary text.
+**Video demonstration** (click the picture):
+[![Toolbox demo](https://i.imgur.com/8lFUlgz.png)](https://www.youtube.com/watch?v=-O_hYhToKoA)
+### Papers implemented
+| URL | Designation | Title | Implementation source |
+| --- | ----------- | ----- | --------------------- |
+|[**1806.04558**](https://arxiv.org/pdf/1806.04558.pdf) | **SV2TTS** | **Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis** | This repo |
+|[1802.08435](https://arxiv.org/pdf/1802.08435.pdf) | WaveRNN (vocoder) | Efficient Neural Audio Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN) |
+|[1703.10135](https://arxiv.org/pdf/1703.10135.pdf) | Tacotron (synthesizer) | Tacotron: Towards End-to-End Speech Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN)
+|[1710.10467](https://arxiv.org/pdf/1710.10467.pdf) | GE2E (encoder)| Generalized End-To-End Loss for Speaker Verification | This repo |
+## News
+**08/09/22**: Our team at Resemble.AI is releasing a voice conversion model (closed source), check out my demo [here](https://www.youtube.com/watch?v=f075EOzYKog).
+**10/01/22**: I recommend checking out [CoquiTTS](https://github.com/coqui-ai/tts). It's a good and up-to-date TTS repository targeted for the ML community. It can also do voice cloning and more, such as cross-language cloning or voice conversion.
+**28/12/21**: I've done a [major maintenance update](https://github.com/CorentinJ/Real-Time-Voice-Cloning/pull/961). Mostly, I've worked on making setup easier. Find new instructions in the section below.
+**14/02/21**: This repo now runs on PyTorch instead of Tensorflow, thanks to the help of @bluefish.
+**13/11/19**: I'm now working full time and I will rarely maintain this repo anymore. To anyone who reads this:
+- **If you just want to clone your voice (and not someone else's):** I recommend our free plan on [Resemble.AI](https://www.resemble.ai/). You will get a better voice quality and less prosody errors.
+- **If this is not your case:** proceed with this repository, but you might end up being disappointed by the results. If you're planning to work on a serious project, my strong advice: find another TTS repo. Go [here](https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/364) for more info.
+**20/08/19:** I'm working on [resemblyzer](https://github.com/resemble-ai/Resemblyzer), an independent package for the voice encoder (inference only). You can use your trained encoder models from this repo with it.
+## Setup
+### 1. Install Requirements
+1. Both Windows and Linux are supported. A GPU is recommended for training and for inference speed, but is not mandatory.
+2. Python 3.7 is recommended. Python 3.5 or greater should work, but you'll probably have to tweak the dependencies' versions. I recommend setting up a virtual environment using `venv`, but this is optional.
+3. Install [ffmpeg](https://ffmpeg.org/download.html#get-packages). This is necessary for reading audio files.
+4. Install [PyTorch](https://pytorch.org/get-started/locally/). Pick the latest stable version, your operating system, your package manager (pip by default) and finally pick any of the proposed CUDA versions if you have a GPU, otherwise pick CPU. Run the given command.
+5. Install the remaining requirements with `pip install -r requirements.txt`
+### 2. (Optional) Download Pretrained Models
+Pretrained models are now downloaded automatically. If this doesn't work for you, you can manually download them [here](https://github.com/CorentinJ/Real-Time-Voice-Cloning/wiki/Pretrained-models).
+### 3. (Optional) Test Configuration
+Before you download any dataset, you can begin by testing your configuration with:
+`python demo_cli.py`
+If all tests pass, you're good to go.
+### 4. (Optional) Download Datasets
+For playing with the toolbox alone, I only recommend downloading [`LibriSpeech/train-clean-100`](https://www.openslr.org/resources/12/train-clean-100.tar.gz). Extract the contents as `<datasets_root>/LibriSpeech/train-clean-100` where `<datasets_root>` is a directory of your choosing. Other datasets are supported in the toolbox, see [here](https://github.com/CorentinJ/Real-Time-Voice-Cloning/wiki/Training#datasets). You're free not to download any dataset, but then you will need your own data as audio files or you will have to record it with the toolbox.
+### 5. Launch the Toolbox
+You can then try the toolbox:
+`python demo_toolbox.py -d <datasets_root>`
+or
+`python demo_toolbox.py`
+depending on whether you downloaded any datasets. If you are running an X-server or if you have the error `Aborted (core dumped)`, see [this issue](https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/11#issuecomment-504733590).

demo_cli.py ADDED Viewed

	@@ -0,0 +1,208 @@

+import argparse
+import os
+from pathlib import Path
+import librosa
+import numpy as np
+import soundfile as sf
+import torch
+from encoder import inference as encoder
+from encoder.params_model import model_embedding_size as speaker_embedding_size
+from synthesizer.inference import Synthesizer
+from utils.argutils import print_args
+from utils.default_models import ensure_default_models
+from vocoder import inference as vocoder
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    parser.add_argument("-e", "--enc_model_fpath", type=Path,
+                        default="saved_models/default/encoder.pt",
+                        help="Path to a saved encoder")
+    parser.add_argument("-s", "--syn_model_fpath", type=Path,
+                        default="saved_models/default/synthesizer.pt",
+                        help="Path to a saved synthesizer")
+    parser.add_argument("-v", "--voc_model_fpath", type=Path,
+                        default="saved_models/default/vocoder.pt",
+                        help="Path to a saved vocoder")
+    parser.add_argument("--cpu", action="store_true", help=\
+        "If True, processing is done on CPU, even when a GPU is available.")
+    parser.add_argument("--no_sound", action="store_true", help=\
+        "If True, audio won't be played.")
+    parser.add_argument("--seed", type=int, default=None, help=\
+        "Optional random number seed value to make toolbox deterministic.")
+    args = parser.parse_args()
+    arg_dict = vars(args)
+    print_args(args, parser)
+    # Hide GPUs from Pytorch to force CPU processing
+    if arg_dict.pop("cpu"):
+        os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
+    print("Running a test of your configuration...\n")
+    if torch.cuda.is_available():
+        device_id = torch.cuda.current_device()
+        gpu_properties = torch.cuda.get_device_properties(device_id)
+        ## Print some environment information (for debugging purposes)
+        print("Found %d GPUs available. Using GPU %d (%s) of compute capability %d.%d with "
+            "%.1fGb total memory.\n" %
+            (torch.cuda.device_count(),
+            device_id,
+            gpu_properties.name,
+            gpu_properties.major,
+            gpu_properties.minor,
+            gpu_properties.total_memory / 1e9))
+    else:
+        print("Using CPU for inference.\n")
+    ## Load the models one by one.
+    print("Preparing the encoder, the synthesizer and the vocoder...")
+    ensure_default_models(Path("saved_models"))
+    encoder.load_model(args.enc_model_fpath)
+    synthesizer = Synthesizer(args.syn_model_fpath)
+    vocoder.load_model(args.voc_model_fpath)
+    ## Run a test
+    print("Testing your configuration with small inputs.")
+    # Forward an audio waveform of zeroes that lasts 1 second. Notice how we can get the encoder's
+    # sampling rate, which may differ.
+    # If you're unfamiliar with digital audio, know that it is encoded as an array of floats
+    # (or sometimes integers, but mostly floats in this projects) ranging from -1 to 1.
+    # The sampling rate is the number of values (samples) recorded per second, it is set to
+    # 16000 for the encoder. Creating an array of length <sampling_rate> will always correspond
+    # to an audio of 1 second.
+    print("\tTesting the encoder...")
+    encoder.embed_utterance(np.zeros(encoder.sampling_rate))
+    # Create a dummy embedding. You would normally use the embedding that encoder.embed_utterance
+    # returns, but here we're going to make one ourselves just for the sake of showing that it's
+    # possible.
+    embed = np.random.rand(speaker_embedding_size)
+    # Embeddings are L2-normalized (this isn't important here, but if you want to make your own
+    # embeddings it will be).
+    embed /= np.linalg.norm(embed)
+    # The synthesizer can handle multiple inputs with batching. Let's create another embedding to
+    # illustrate that
+    embeds = [embed, np.zeros(speaker_embedding_size)]
+    texts = ["test 1", "test 2"]
+    print("\tTesting the synthesizer... (loading the model will output a lot of text)")
+    mels = synthesizer.synthesize_spectrograms(texts, embeds)
+    # The vocoder synthesizes one waveform at a time, but it's more efficient for long ones. We
+    # can concatenate the mel spectrograms to a single one.
+    mel = np.concatenate(mels, axis=1)
+    # The vocoder can take a callback function to display the generation. More on that later. For
+    # now we'll simply hide it like this:
+    no_action = lambda *args: None
+    print("\tTesting the vocoder...")
+    # For the sake of making this test short, we'll pass a short target length. The target length
+    # is the length of the wav segments that are processed in parallel. E.g. for audio sampled
+    # at 16000 Hertz, a target length of 8000 means that the target audio will be cut in chunks of
+    # 0.5 seconds which will all be generated together. The parameters here are absurdly short, and
+    # that has a detrimental effect on the quality of the audio. The default parameters are
+    # recommended in general.
+    vocoder.infer_waveform(mel, target=200, overlap=50, progress_callback=no_action)
+    print("All test passed! You can now synthesize speech.\n\n")
+    ## Interactive speech generation
+    print("This is a GUI-less example of interface to SV2TTS. The purpose of this script is to "
+          "show how you can interface this project easily with your own. See the source code for "
+          "an explanation of what is happening.\n")
+    print("Interactive generation loop")
+    num_generated = 0
+    while True:
+        try:
+            # Get the reference audio filepath
+            message = "Reference voice: enter an audio filepath of a voice to be cloned (mp3, " \
+                      "wav, m4a, flac, ...):\n"
+            in_fpath = Path(input(message).replace("\"", "").replace("\'", ""))
+            ## Computing the embedding
+            # First, we load the wav using the function that the speaker encoder provides. This is
+            # important: there is preprocessing that must be applied.
+            # The following two methods are equivalent:
+            # - Directly load from the filepath:
+            preprocessed_wav = encoder.preprocess_wav(in_fpath)
+            # - If the wav is already loaded:
+            original_wav, sampling_rate = librosa.load(str(in_fpath))
+            preprocessed_wav = encoder.preprocess_wav(original_wav, sampling_rate)
+            print("Loaded file succesfully")
+            # Then we derive the embedding. There are many functions and parameters that the
+            # speaker encoder interfaces. These are mostly for in-depth research. You will typically
+            # only use this function (with its default parameters):
+            embed = encoder.embed_utterance(preprocessed_wav)
+            print("Created the embedding")
+            ## Generating the spectrogram
+            text = input("Write a sentence (+-20 words) to be synthesized:\n")
+            # If seed is specified, reset torch seed and force synthesizer reload
+            if args.seed is not None:
+                torch.manual_seed(args.seed)
+                synthesizer = Synthesizer(args.syn_model_fpath)
+            # The synthesizer works in batch, so you need to put your data in a list or numpy array
+            texts = [text]
+            embeds = [embed]
+            # If you know what the attention layer alignments are, you can retrieve them here by
+            # passing return_alignments=True
+            specs = synthesizer.synthesize_spectrograms(texts, embeds)
+            spec = specs[0]
+            print("Created the mel spectrogram")
+            ## Generating the waveform
+            print("Synthesizing the waveform:")
+            # If seed is specified, reset torch seed and reload vocoder
+            if args.seed is not None:
+                torch.manual_seed(args.seed)
+                vocoder.load_model(args.voc_model_fpath)
+            # Synthesizing the waveform is fairly straightforward. Remember that the longer the
+            # spectrogram, the more time-efficient the vocoder.
+            generated_wav = vocoder.infer_waveform(spec)
+            ## Post-generation
+            # There's a bug with sounddevice that makes the audio cut one second earlier, so we
+            # pad it.
+            generated_wav = np.pad(generated_wav, (0, synthesizer.sample_rate), mode="constant")
+            # Trim excess silences to compensate for gaps in spectrograms (issue #53)
+            generated_wav = encoder.preprocess_wav(generated_wav)
+            # Play the audio (non-blocking)
+            if not args.no_sound:
+                import sounddevice as sd
+                try:
+                    sd.stop()
+                    sd.play(generated_wav, synthesizer.sample_rate)
+                except sd.PortAudioError as e:
+                    print("\nCaught exception: %s" % repr(e))
+                    print("Continuing without audio playback. Suppress this message with the \"--no_sound\" flag.\n")
+                except:
+                    raise
+            # Save it on the disk
+            filename = "demo_output_%02d.wav" % num_generated
+            print(generated_wav.dtype)
+            sf.write(filename, generated_wav.astype(np.float32), synthesizer.sample_rate)
+            num_generated += 1
+            print("\nSaved output as %s\n\n" % filename)
+        except Exception as e:
+            print("Caught exception: %s" % repr(e))
+            print("Restarting\n")

demo_toolbox.py ADDED Viewed

	@@ -0,0 +1,37 @@

+import argparse
+import os
+from pathlib import Path
+from toolbox import Toolbox
+from utils.argutils import print_args
+from utils.default_models import ensure_default_models
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(
+        description="Runs the toolbox.",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    parser.add_argument("-d", "--datasets_root", type=Path, help= \
+        "Path to the directory containing your datasets. See toolbox/__init__.py for a list of "
+        "supported datasets.", default=None)
+    parser.add_argument("-m", "--models_dir", type=Path, default="saved_models",
+                        help="Directory containing all saved models")
+    parser.add_argument("--cpu", action="store_true", help=\
+        "If True, all inference will be done on CPU")
+    parser.add_argument("--seed", type=int, default=None, help=\
+        "Optional random number seed value to make toolbox deterministic.")
+    args = parser.parse_args()
+    arg_dict = vars(args)
+    print_args(args, parser)
+    # Hide GPUs from Pytorch to force CPU processing
+    if arg_dict.pop("cpu"):
+        os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
+    # Remind the user to download pretrained models if needed
+    ensure_default_models(args.models_dir)
+    # Launch the toolbox
+    Toolbox(**arg_dict)

encoder.zip ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2833c484d2aeb64cde790e52f79db760d4399efca62b64fc2722bbc1c9b14cff
+size 31052

encoder_preprocess.py ADDED Viewed

	@@ -0,0 +1,71 @@

+from encoder.preprocess import preprocess_librispeech, preprocess_voxceleb1, preprocess_voxceleb2
+from utils.argutils import print_args
+from pathlib import Path
+import argparse
+if __name__ == "__main__":
+    class MyFormatter(argparse.ArgumentDefaultsHelpFormatter, argparse.RawDescriptionHelpFormatter):
+        pass
+    parser = argparse.ArgumentParser(
+        description="Preprocesses audio files from datasets, encodes them as mel spectrograms and "
+                    "writes them to the disk. This will allow you to train the encoder. The "
+                    "datasets required are at least one of VoxCeleb1, VoxCeleb2 and LibriSpeech. "
+                    "Ideally, you should have all three. You should extract them as they are "
+                    "after having downloaded them and put them in a same directory, e.g.:\n"
+                    "-[datasets_root]\n"
+                    "  -LibriSpeech\n"
+                    "    -train-other-500\n"
+                    "  -VoxCeleb1\n"
+                    "    -wav\n"
+                    "    -vox1_meta.csv\n"
+                    "  -VoxCeleb2\n"
+                    "    -dev",
+        formatter_class=MyFormatter
+    )
+    parser.add_argument("datasets_root", type=Path, help=\
+        "Path to the directory containing your LibriSpeech/TTS and VoxCeleb datasets.")
+    parser.add_argument("-o", "--out_dir", type=Path, default=argparse.SUPPRESS, help=\
+        "Path to the output directory that will contain the mel spectrograms. If left out, "
+        "defaults to <datasets_root>/SV2TTS/encoder/")
+    parser.add_argument("-d", "--datasets", type=str,
+                        default="librispeech_other,voxceleb1,voxceleb2", help=\
+        "Comma-separated list of the name of the datasets you want to preprocess. Only the train "
+        "set of these datasets will be used. Possible names: librispeech_other, voxceleb1, "
+        "voxceleb2.")
+    parser.add_argument("-s", "--skip_existing", action="store_true", help=\
+        "Whether to skip existing output files with the same name. Useful if this script was "
+        "interrupted.")
+    parser.add_argument("--no_trim", action="store_true", help=\
+        "Preprocess audio without trimming silences (not recommended).")
+    args = parser.parse_args()
+    # Verify webrtcvad is available
+    if not args.no_trim:
+        try:
+            import webrtcvad
+        except:
+            raise ModuleNotFoundError("Package 'webrtcvad' not found. This package enables "
+                "noise removal and is recommended. Please install and try again. If installation fails, "
+                "use --no_trim to disable this error message.")
+    del args.no_trim
+    # Process the arguments
+    args.datasets = args.datasets.split(",")
+    if not hasattr(args, "out_dir"):
+        args.out_dir = args.datasets_root.joinpath("SV2TTS", "encoder")
+    assert args.datasets_root.exists()
+    args.out_dir.mkdir(exist_ok=True, parents=True)
+    # Preprocess the datasets
+    print_args(args, parser)
+    preprocess_func = {
+        "librispeech_other": preprocess_librispeech,
+        "voxceleb1": preprocess_voxceleb1,
+        "voxceleb2": preprocess_voxceleb2,
+    }
+    args = vars(args)
+    for dataset in args.pop("datasets"):
+        print("Preprocessing %s" % dataset)
+        preprocess_func[dataset](**args)

encoder_train.py ADDED Viewed

	@@ -0,0 +1,44 @@

+from utils.argutils import print_args
+from encoder.train import train
+from pathlib import Path
+import argparse
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Trains the speaker encoder. You must have run encoder_preprocess.py first.",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    parser.add_argument("run_id", type=str, help= \
+        "Name for this model. By default, training outputs will be stored to saved_models/<run_id>/. If a model state "
+        "from the same run ID was previously saved, the training will restart from there. Pass -f to overwrite saved "
+        "states and restart from scratch.")
+    parser.add_argument("clean_data_root", type=Path, help= \
+        "Path to the output directory of encoder_preprocess.py. If you left the default "
+        "output directory when preprocessing, it should be <datasets_root>/SV2TTS/encoder/.")
+    parser.add_argument("-m", "--models_dir", type=Path, default="saved_models", help=\
+        "Path to the root directory that contains all models. A directory <run_name> will be created under this root."
+        "It will contain the saved model weights, as well as backups of those weights and plots generated during "
+        "training.")
+    parser.add_argument("-v", "--vis_every", type=int, default=10, help= \
+        "Number of steps between updates of the loss and the plots.")
+    parser.add_argument("-u", "--umap_every", type=int, default=100, help= \
+        "Number of steps between updates of the umap projection. Set to 0 to never update the "
+        "projections.")
+    parser.add_argument("-s", "--save_every", type=int, default=500, help= \
+        "Number of steps between updates of the model on the disk. Set to 0 to never save the "
+        "model.")
+    parser.add_argument("-b", "--backup_every", type=int, default=7500, help= \
+        "Number of steps between backups of the model. Set to 0 to never make backups of the "
+        "model.")
+    parser.add_argument("-f", "--force_restart", action="store_true", help= \
+        "Do not load any saved model.")
+    parser.add_argument("--visdom_server", type=str, default="http://localhost")
+    parser.add_argument("--no_visdom", action="store_true", help= \
+        "Disable visdom.")
+    args = parser.parse_args()
+    # Run the training
+    print_args(args, parser)
+    train(**vars(args))

models.zip ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:65687307efd53bc7e820eb5cd38e6655f0d1c98c8203f13e34b90adf9a959a24
+size 12373

requirements.txt ADDED Viewed

Binary file (562 Bytes). View file

samples.zip ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3ec993c333b396acefdefa19336428cf924e249f5ac58fffad614b267542fbf8
+size 104866

synthesizer.zip ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8c597393b1576379fcfebf30d59dd8654841159f850684bbd2bb2712a1bd99ed
+size 81237

synthesizer_preprocess_audio.py ADDED Viewed

	@@ -0,0 +1,47 @@

+from synthesizer.preprocess import preprocess_dataset
+from synthesizer.hparams import hparams
+from utils.argutils import print_args
+from pathlib import Path
+import argparse
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Preprocesses audio files from datasets, encodes them as mel spectrograms "
+                    "and writes them to  the disk. Audio files are also saved, to be used by the "
+                    "vocoder for training.",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    parser.add_argument("datasets_root", type=Path, help=\
+        "Path to the directory containing your LibriSpeech/TTS datasets.")
+    parser.add_argument("-o", "--out_dir", type=Path, default=argparse.SUPPRESS, help=\
+        "Path to the output directory that will contain the mel spectrograms, the audios and the "
+        "embeds. Defaults to <datasets_root>/SV2TTS/synthesizer/")
+    parser.add_argument("-n", "--n_processes", type=int, default=4, help=\
+        "Number of processes in parallel.")
+    parser.add_argument("-s", "--skip_existing", action="store_true", help=\
+        "Whether to overwrite existing files with the same name. Useful if the preprocessing was "
+        "interrupted.")
+    parser.add_argument("--hparams", type=str, default="", help=\
+        "Hyperparameter overrides as a comma-separated list of name-value pairs")
+    parser.add_argument("--no_alignments", action="store_true", help=\
+        "Use this option when dataset does not include alignments\
+        (these are used to split long audio files into sub-utterances.)")
+    parser.add_argument("--datasets_name", type=str, default="LibriSpeech", help=\
+        "Name of the dataset directory to process.")
+    parser.add_argument("--subfolders", type=str, default="train-clean-100,train-clean-360", help=\
+        "Comma-separated list of subfolders to process inside your dataset directory")
+    args = parser.parse_args()
+    # Process the arguments
+    if not hasattr(args, "out_dir"):
+        args.out_dir = args.datasets_root.joinpath("SV2TTS", "synthesizer")
+    # Create directories
+    assert args.datasets_root.exists()
+    args.out_dir.mkdir(exist_ok=True, parents=True)
+    # Preprocess the dataset
+    print_args(args, parser)
+    args.hparams = hparams.parse(args.hparams)
+    preprocess_dataset(**vars(args))

synthesizer_preprocess_embeds.py ADDED Viewed

	@@ -0,0 +1,25 @@

+from synthesizer.preprocess import create_embeddings
+from utils.argutils import print_args
+from pathlib import Path
+import argparse
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Creates embeddings for the synthesizer from the LibriSpeech utterances.",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    parser.add_argument("synthesizer_root", type=Path, help=\
+        "Path to the synthesizer training data that contains the audios and the train.txt file. "
+        "If you let everything as default, it should be <datasets_root>/SV2TTS/synthesizer/.")
+    parser.add_argument("-e", "--encoder_model_fpath", type=Path,
+                        default="saved_models/default/encoder.pt", help=\
+        "Path your trained encoder model.")
+    parser.add_argument("-n", "--n_processes", type=int, default=4, help= \
+        "Number of parallel processes. An encoder is created for each, so you may need to lower "
+        "this value on GPUs with low memory. Set it to 1 if CUDA is unhappy.")
+    args = parser.parse_args()
+    # Preprocess the dataset
+    print_args(args, parser)
+    create_embeddings(**vars(args))

synthesizer_train.py ADDED Viewed

	@@ -0,0 +1,36 @@

+from pathlib import Path
+from synthesizer.hparams import hparams
+from synthesizer.train import train
+from utils.argutils import print_args
+import argparse
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("run_id", type=str, help= \
+        "Name for this model. By default, training outputs will be stored to saved_models/<run_id>/. If a model state "
+        "from the same run ID was previously saved, the training will restart from there. Pass -f to overwrite saved "
+        "states and restart from scratch.")
+    parser.add_argument("syn_dir", type=Path, help= \
+        "Path to the synthesizer directory that contains the ground truth mel spectrograms, "
+        "the wavs and the embeds.")
+    parser.add_argument("-m", "--models_dir", type=Path, default="saved_models", help=\
+        "Path to the output directory that will contain the saved model weights and the logs.")
+    parser.add_argument("-s", "--save_every", type=int, default=1000, help= \
+        "Number of steps between updates of the model on the disk. Set to 0 to never save the "
+        "model.")
+    parser.add_argument("-b", "--backup_every", type=int, default=25000, help= \
+        "Number of steps between backups of the model. Set to 0 to never make backups of the "
+        "model.")
+    parser.add_argument("-f", "--force_restart", action="store_true", help= \
+        "Do not load any saved model and restart from scratch.")
+    parser.add_argument("--hparams", default="", help=\
+        "Hyperparameter overrides as a comma-separated list of name=value pairs")
+    args = parser.parse_args()
+    print_args(args, parser)
+    args.hparams = hparams.parse(args.hparams)
+    # Run the training
+    train(**vars(args))

toolbox.zip ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f8d3271d555a8e7c712fd0761ffbdeba3d7f71dc8fd28ab7c634d32189f556a8
+size 10516

utils.zip ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:27b3609444ce073545cfa79cf9941f7b80cf9e6851cb5a9b7d68dd1ea044ddb0
+size 6243

vocoder.zip ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b1ab43d493be55641a1737d581c3ca86a4232af787687ba5e627180918cf73c0
+size 47098

vocoder_preprocess.py ADDED Viewed

	@@ -0,0 +1,48 @@

+import argparse
+import os
+from pathlib import Path
+from synthesizer.hparams import hparams
+from synthesizer.synthesize import run_synthesis
+from utils.argutils import print_args
+if __name__ == "__main__":
+    class MyFormatter(argparse.ArgumentDefaultsHelpFormatter, argparse.RawDescriptionHelpFormatter):
+        pass
+    parser = argparse.ArgumentParser(
+        description="Creates ground-truth aligned (GTA) spectrograms from the vocoder.",
+        formatter_class=MyFormatter
+    )
+    parser.add_argument("datasets_root", type=Path, help=\
+        "Path to the directory containing your SV2TTS directory. If you specify both --in_dir and "
+        "--out_dir, this argument won't be used.")
+    parser.add_argument("-s", "--syn_model_fpath", type=Path,
+                        default="saved_models/default/synthesizer.pt",
+                        help="Path to a saved synthesizer")
+    parser.add_argument("-i", "--in_dir", type=Path, default=argparse.SUPPRESS, help= \
+        "Path to the synthesizer directory that contains the mel spectrograms, the wavs and the "
+        "embeds. Defaults to  <datasets_root>/SV2TTS/synthesizer/.")
+    parser.add_argument("-o", "--out_dir", type=Path, default=argparse.SUPPRESS, help= \
+        "Path to the output vocoder directory that will contain the ground truth aligned mel "
+        "spectrograms. Defaults to <datasets_root>/SV2TTS/vocoder/.")
+    parser.add_argument("--hparams", default="", help=\
+        "Hyperparameter overrides as a comma-separated list of name=value pairs")
+    parser.add_argument("--cpu", action="store_true", help=\
+        "If True, processing is done on CPU, even when a GPU is available.")
+    args = parser.parse_args()
+    print_args(args, parser)
+    modified_hp = hparams.parse(args.hparams)
+    if not hasattr(args, "in_dir"):
+        args.in_dir = args.datasets_root / "SV2TTS" / "synthesizer"
+    if not hasattr(args, "out_dir"):
+        args.out_dir = args.datasets_root / "SV2TTS" / "vocoder"
+    if args.cpu:
+        # Hide GPUs from Pytorch to force CPU processing
+        os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
+    run_synthesis(args.in_dir, args.out_dir, args.syn_model_fpath, modified_hp)

vocoder_train (1).py ADDED Viewed

	@@ -0,0 +1,53 @@

+import argparse
+from pathlib import Path
+from utils.argutils import print_args
+from vocoder.train import train
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Trains the vocoder from the synthesizer audios and the GTA synthesized mels, "
+                    "or ground truth mels.",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    parser.add_argument("run_id", type=str, help= \
+        "Name for this model. By default, training outputs will be stored to saved_models/<run_id>/. If a model state "
+        "from the same run ID was previously saved, the training will restart from there. Pass -f to overwrite saved "
+        "states and restart from scratch.")
+    parser.add_argument("datasets_root", type=Path, help= \
+        "Path to the directory containing your SV2TTS directory. Specifying --syn_dir or --voc_dir "
+        "will take priority over this argument.")
+    parser.add_argument("--syn_dir", type=Path, default=argparse.SUPPRESS, help= \
+        "Path to the synthesizer directory that contains the ground truth mel spectrograms, "
+        "the wavs and the embeds. Defaults to <datasets_root>/SV2TTS/synthesizer/.")
+    parser.add_argument("--voc_dir", type=Path, default=argparse.SUPPRESS, help= \
+        "Path to the vocoder directory that contains the GTA synthesized mel spectrograms. "
+        "Defaults to <datasets_root>/SV2TTS/vocoder/. Unused if --ground_truth is passed.")
+    parser.add_argument("-m", "--models_dir", type=Path, default="saved_models", help=\
+        "Path to the directory that will contain the saved model weights, as well as backups "
+        "of those weights and wavs generated during training.")
+    parser.add_argument("-g", "--ground_truth", action="store_true", help= \
+        "Train on ground truth spectrograms (<datasets_root>/SV2TTS/synthesizer/mels).")
+    parser.add_argument("-s", "--save_every", type=int, default=1000, help= \
+        "Number of steps between updates of the model on the disk. Set to 0 to never save the "
+        "model.")
+    parser.add_argument("-b", "--backup_every", type=int, default=25000, help= \
+        "Number of steps between backups of the model. Set to 0 to never make backups of the "
+        "model.")
+    parser.add_argument("-f", "--force_restart", action="store_true", help= \
+        "Do not load any saved model and restart from scratch.")
+    args = parser.parse_args()
+    # Process the arguments
+    if not hasattr(args, "syn_dir"):
+        args.syn_dir = args.datasets_root / "SV2TTS" / "synthesizer"
+    if not hasattr(args, "voc_dir"):
+        args.voc_dir = args.datasets_root / "SV2TTS" / "vocoder"
+    del args.datasets_root
+    args.models_dir.mkdir(exist_ok=True)
+    # Run the training
+    print_args(args, parser)
+    train(**vars(args))

vocoder_train.py ADDED Viewed

	@@ -0,0 +1,53 @@

+import argparse
+from pathlib import Path
+from utils.argutils import print_args
+from vocoder.train import train
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Trains the vocoder from the synthesizer audios and the GTA synthesized mels, "
+                    "or ground truth mels.",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    parser.add_argument("run_id", type=str, help= \
+        "Name for this model. By default, training outputs will be stored to saved_models/<run_id>/. If a model state "
+        "from the same run ID was previously saved, the training will restart from there. Pass -f to overwrite saved "
+        "states and restart from scratch.")
+    parser.add_argument("datasets_root", type=Path, help= \
+        "Path to the directory containing your SV2TTS directory. Specifying --syn_dir or --voc_dir "
+        "will take priority over this argument.")
+    parser.add_argument("--syn_dir", type=Path, default=argparse.SUPPRESS, help= \
+        "Path to the synthesizer directory that contains the ground truth mel spectrograms, "
+        "the wavs and the embeds. Defaults to <datasets_root>/SV2TTS/synthesizer/.")
+    parser.add_argument("--voc_dir", type=Path, default=argparse.SUPPRESS, help= \
+        "Path to the vocoder directory that contains the GTA synthesized mel spectrograms. "
+        "Defaults to <datasets_root>/SV2TTS/vocoder/. Unused if --ground_truth is passed.")
+    parser.add_argument("-m", "--models_dir", type=Path, default="saved_models", help=\
+        "Path to the directory that will contain the saved model weights, as well as backups "
+        "of those weights and wavs generated during training.")
+    parser.add_argument("-g", "--ground_truth", action="store_true", help= \
+        "Train on ground truth spectrograms (<datasets_root>/SV2TTS/synthesizer/mels).")
+    parser.add_argument("-s", "--save_every", type=int, default=1000, help= \
+        "Number of steps between updates of the model on the disk. Set to 0 to never save the "
+        "model.")
+    parser.add_argument("-b", "--backup_every", type=int, default=25000, help= \
+        "Number of steps between backups of the model. Set to 0 to never make backups of the "
+        "model.")
+    parser.add_argument("-f", "--force_restart", action="store_true", help= \
+        "Do not load any saved model and restart from scratch.")
+    args = parser.parse_args()
+    # Process the arguments
+    if not hasattr(args, "syn_dir"):
+        args.syn_dir = args.datasets_root / "SV2TTS" / "synthesizer"
+    if not hasattr(args, "voc_dir"):
+        args.voc_dir = args.datasets_root / "SV2TTS" / "vocoder"
+    del args.datasets_root
+    args.models_dir.mkdir(exist_ok=True)
+    # Run the training
+    print_args(args, parser)
+    train(**vars(args))