File size: 7,281 Bytes
8c92a11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
# VALL-E
## Introduction
This is an unofficial PyTorch implementation of VALL-E, a zero-shot voice cloning model via neural codec language modeling ([paper link](https://arxiv.org/abs/2301.02111)). 
If trained properly, this model could match the performance specified in the original paper.

## Change notes
This is a refined version compared to the first version of VALL-E in Amphion, we have changed the underlying implementation to Llama
to provide better model performance, faster training speed, and more readable codes.
This can be a great tool if you want to learn speech language models and its implementation.

## Installation requirement 

Set up your environemnt as in Amphion README (you'll need a conda environment, and we recommend using Linux). A GPU is recommended if you want to train this model yourself.
For inferencing our pretrained models, you could generate samples even without a GPU.
To ensure your transformers library can run the code, we recommend additionally running:
```bash
pip install -U transformers==4.41.2
```

## Inferencing pretrained VALL-E models
### Download pretrained weights
You need to download our pretrained weights from huggingface. 

Script to download AR and NAR model checkpoint: 
```bash
huggingface-cli download amphion/valle valle_ar_mls_196000.bin valle_nar_mls_164000.bin --local-dir ckpts
```
Script to download codec model (SpeechTokenizer) checkpoint:
```bash
mkdir -p ckpts/speechtokenizer_hubert_avg && huggingface-cli download amphion/valle SpeechTokenizer.pt config.json --local-dir ckpts/speechtokenizer_hubert_avg
```

If you cannot access huggingface, consider using the huggingface mirror to download: 
```bash
HF_ENDPOINT=https://hf-mirror.com huggingface-cli download amphion/valle valle_ar_mls_196000.bin valle_nar_mls_164000.bin --local-dir ckpts
```
```bash
mkdir -p ckpts/speechtokenizer_hubert_avg && HF_ENDPOINT=https://hf-mirror.com huggingface-cli download amphion/valle SpeechTokenizer.pt config.json --local-dir ckpts/speechtokenizer_hubert_avg
```


### Inference in IPython notebook

We provide our pretrained VALL-E model that is trained on 45k hours MLS dataset, which contains 10-20s English speech.
The "demo.ipynb" file provides a working example of inferencing our pretrained VALL-E model. Give it a try!

## Examining the model files
Examining the model files of VALL-E is a great way to learn how it works.
We provide examples that allows you to overfit a single batch (so no dataset downloading is required). 

The AR model is essentially a causal language model that "continues" a speech. The NAR model is a modification from the AR model that allows for bidirectional attention.


File `valle_ar.py` and `valle_nar.py` in "models/tts/valle_v2" folder are models files, these files can be run directly via `python -m models.tts.valle_v2.valle_ar` (or `python -m models.tts.valle_v2.valle_nar`).
This will invoke a test which overfits it to a single example.

## Training VALL-E from scratch
### Preparing LibriTTS or LibriTTS-R dataset files

We have tested our training script on LibriTTS and LibriTTS-R.
You could download LibriTTS-R at [this link](https://www.openslr.org/141/) and LibriTTS at [this link](https://www.openslr.org/60).
The "train-clean-360" split is currently used by our configuration.
You can test dataset.py by run `python -m models.tts.valle_v2.libritts_dataset`.

For your reference, our unzipped dataset files has a file structure like this:
```
/path/to/LibriTTS_R
β”œβ”€β”€ BOOKS.txt
β”œβ”€β”€ CHAPTERS.txt
β”œβ”€β”€ dev-clean
β”‚   β”œβ”€β”€ 2412
β”‚   β”‚   β”œβ”€β”€ 153947
β”‚   β”‚   β”‚   β”œβ”€β”€ 2412_153947_000014_000000.normalized.txt
β”‚   β”‚   β”‚   β”œβ”€β”€ 2412_153947_000014_000000.original.txt
β”‚   β”‚   β”‚   β”œβ”€β”€ 2412_153947_000014_000000.wav
β”‚   β”‚   β”‚   β”œβ”€β”€ 2412_153947_000017_000001.normalized.txt
β”‚   β”‚   β”‚   β”œβ”€β”€ 2412_153947_000017_000001.original.txt
β”‚   β”‚   β”‚   β”œβ”€β”€ 2412_153947_000017_000001.wav
β”‚   β”‚   β”‚   β”œβ”€β”€ 2412_153947_000017_000005.normalized.txt
β”œβ”€β”€ train-clean-360
    β”œβ”€β”€ 422
β”‚   β”‚   └── 122949
β”‚   β”‚       β”œβ”€β”€ 422_122949_000009_000007.normalized.txt
β”‚   β”‚       β”œβ”€β”€ 422_122949_000009_000007.original.txt
β”‚   β”‚       β”œβ”€β”€ 422_122949_000009_000007.wav
β”‚   β”‚       β”œβ”€β”€ 422_122949_000013_000010.normalized.txt
β”‚   β”‚       β”œβ”€β”€ 422_122949_000013_000010.original.txt
β”‚   β”‚       β”œβ”€β”€ 422_122949_000013_000010.wav
β”‚   β”‚       β”œβ”€β”€ 422_122949.book.tsv
β”‚   β”‚       └── 422_122949.trans.tsv
```


Alternativelly, you could write your own dataloader for your dataset. 
You can reference the `__getitem__` method in `models/tts/VALLE_V2/mls_dataset.py`
It should return a dict of a 1-dimensional tensor 'speech', which is a 16kHz speech; and a 1-dimensional tensor of 'phone', which is the phoneme sequence of the speech.
As long as your dataset returns this in `__getitem__`, it should work.

### Changing batch size and dataset path in configuration file
Our configuration file for training VALL-E AR model is at "egs/tts/VALLE_V2/exp_ar_libritts.json", and NAR model at "egs/tts/VALLE_V2/exp_nar_libritts.json"

To train your model, you need to modify the `dataset` variable in the json configurations.
Currently it's at line 40, you should modify the "data_dir" to your dataset's root directory.
```
    "dataset": {
      "dataset_list":["train-clean-360"], // You can also change to other splits like "dev-clean"
      "data_dir": "/path/to/your/LibriTTS_R",
    },
```

You should also select a reasonable batch size at the "batch_size" entry (currently it's set at 5).


You can change other experiment settings in the `/egs/tts/VALLE_V2/exp_ar_libritts.json` such as the learning rate, optimizer and the dataset.

### Run the command to Train AR model
(Make sure your current directory is at the Amphion root directory).
Run:
```sh
sh egs/tts/VALLE_V2/train_ar_libritts.sh
```
Your initial model checkpoint could be found in places such as `ckpt/VALLE_V2/ar_libritts/checkpoint/epoch-0000_step-0000000_loss-7.397293/pytorch_model.bin`


### Resume from existing checkpoint
Our framework supports resuming from existing checkpoint.

Run:
```sh
sh egs/tts/VALLE_V2/train_ar_libritts.sh --resume
```

### Finetuning based on our AR model
We provide our AR model optimizer, and random_states checkpoints to support finetuning (No need to download these files if you're only inferencing from the pretrained model). First rename the models as "pytorch_model.bin", "optimizer.bin", and "random_states_0.pkl", then you could resume from these checkpoints. [Link to AR optimizer checkpoint](https://huggingface.co./amphion/valle/blob/main/optimizer_valle_ar_mls_196000.bin) and [Link to random_states.pkl](https://huggingface.co./amphion/valle/blob/main/random_states_0.pkl).


### Run the command to Train NAR model
(Make sure your current directory is at the Amphion root directory).
Run:
```sh
sh egs/tts/VALLE_V2/train_nar_libritts.sh
```

### Inference your models
Since our inference script is already given, you can change the paths
from our pretrained model to you newly trained models and do the inference.

## Future plans
- [ ] Support more languages
- [ ] More are coming...