English
speech quantization
zhihaodu commited on
Commit
313f69d
·
1 Parent(s): be8c86e

init commit

Browse files
Files changed (6) hide show
  1. LICENSE +21 -0
  2. README.md +139 -0
  3. config.yaml +188 -0
  4. example/example.wav +0 -0
  5. fig/framework.png +0 -0
  6. model.pth +3 -0
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2023 Alibaba Inc.
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md CHANGED
@@ -1,3 +1,142 @@
1
  ---
 
 
 
2
  license: mit
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
3
+ tags:
4
+ - speech quantization
5
  license: mit
6
+ datasets:
7
+ - LibriTTS
8
  ---
9
+
10
+ # Highlights
11
+ This model is used for speech codec or quantization on English utterances.
12
+ - Achieving higher codec quality under low band widths
13
+ - Training with structured dropout, enabling various band widths during inference with a single model
14
+ - Quantizing a raw speech waveform into a sequence of discrete tokens
15
+
16
+ # FunCodec model
17
+ This model is trained with [FunCodec](https://github.com/alibaba-damo-academy/FunCodec),
18
+ an open-source toolkits for speech quantization (codec) from the Damo academy, Alibaba Group.
19
+ This repository provides a pre-trained model on the LibriTTS corpus.
20
+ It can be applied to low-band-width speech communication, speech quantization, zero-shot speech synthesis
21
+ and other academic research topics.
22
+ Compared with [EnCodec](https://arxiv.org/abs/2210.13438) and [SoundStream](https://arxiv.org/abs/2107.03312),
23
+ the following improved techniques are utilized to train the model, resulting in higher codec quality and
24
+ [ViSQOL](https://github.com/google/visqol) scores under the same band width:
25
+ - The magnitude spectrum loss is employed to enhance the middle and high frequency signals
26
+ - Structured dropout is employed to smooth the code space, as well as enable various band widths in a single model
27
+ - Codes are initialized by k-means clusters rather than random values
28
+ - Codebooks are maintained with exponential moving average and dead-code-elimination mechanism, resulting in high utilization factor for codebooks.
29
+
30
+ ## Model description
31
+ This model is a variational autoencoder that uses residual vector quantisation (RVQ) to obtain
32
+ several parallel sequences of discrete latent representations. Here is an overview of FunCodec models.
33
+ <p align="center">
34
+ <img src="fig/framework.png" alt="FunCodec architecture"/>
35
+ </p>
36
+
37
+ In general, FunCodec models consist of five modules: a domain transformation module,
38
+ an encoder, a RVQ module, a decoder and a domain inversion module.
39
+ - Domain Transformation:transfer signals into time domain, short-time frequency domain, magnitude-angle domain or magnitude-phase domain.
40
+ - Encoder:encode signals into compact representations with stacked convolutional and LSTM layers.
41
+ - Semantic tokens (Optional): augment encoder outputs with semantic tokens to enhance the content information, not used in this model.
42
+ - RVQ:quantize the representations into parallel sequences of discrete tokens with cascaded vector quantizers.
43
+ - Decoder:decode quantized embeddings into different signal domains the same as inputs.
44
+ - Domain Inversion:re-synthesize perceptible waveforms from different domains.
45
+
46
+ More details can be found at:
47
+ - Paper: [FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec](https://arxiv.org/abs/2309.07405)
48
+ - Codebase: [FunCodec](https://github.com/alibaba-damo-academy/FunCodec)
49
+
50
+ ## Intended uses & sceneries
51
+ ### Inference with FunCodec
52
+
53
+ You can extract codecs and reconstruct them back to waveforms with FunCodec repository.
54
+
55
+ #### FunCodec installation
56
+ ```sh
57
+ # Install Pytorch GPU (version >= 1.12.0):
58
+ conda install pytorch==1.12.0
59
+ # for other versions, please refer: https://pytorch.org/get-started/locally
60
+
61
+ # Download codebase:
62
+ git clone https://github.com/alibaba-damo-academy/FunCodec.git
63
+
64
+ # Install FunCodec codebase:
65
+ cd FunCodec
66
+ pip install --editable ./
67
+ ```
68
+
69
+ #### Codec extraction
70
+ ```sh
71
+ # Enter the example directory
72
+ cd egs/LibriTTS/codec
73
+ # Specify the model name
74
+ model_name="audio_codec-encodec-en-libritts-16k-nq32ds320-pytorch"
75
+ # Download the model
76
+ git lfs install
77
+ git clone https://huggingface.co/alibaba-damo/${model_name}
78
+ mkdir exp
79
+ mv ${model_name} exp/$model_name
80
+ # Extracting codec within the input file "input_wav.scp" and the codecs are saved under "outputs/codecs"
81
+ bash encoding_decoding.sh --stage 1 --batch_size 16 --num_workers 4 --gpu_devices "0,1" \
82
+ --model_dir exp/${model_name} --bit_width 16000 --file_sampling_rate 16000 \
83
+ --wav_scp input_wav.scp --out_dir outputs/codecs
84
+ # input_wav.scp has the following format:
85
+ # uttid1 path/to/file1.wav
86
+ # uttid2 path/to/file2.wav
87
+ # ...
88
+ ```
89
+
90
+ ### Reconstruct waveforms from codecs
91
+ ```shell
92
+ # Reconstruct waveforms into "outputs/recon_wavs"
93
+ bash encoding_decoding.sh --stage 2 --batch_size 16 --num_workers 4 --gpu_devices "0,1" \
94
+ --model_dir exp/${model_name} --bit_width 16000 --file_sampling_rate 16000 \
95
+ --wav_scp outputs/codecs/codecs.txt --out_dir outputs/recon_wavs
96
+ # codecs.txt is the output of stage 1, which has the following format:
97
+ # uttid1 [[[1, 2, 3, ...],[2, 3, 4, ...], ...]]
98
+ # uttid2 [[[9, 7, 5, ...],[3, 1, 2, ...], ...]]
99
+ # ...
100
+ ```
101
+
102
+ ### Inference with Huggingface Transformers
103
+ Inference with Huggingface transformers package is under development.
104
+
105
+ ### Application sceneries
106
+ Running environment
107
+ - Currently, the model only passed the tests on Linux-x86_64. Mac and Windows systems are not tested.
108
+
109
+ Intended using sceneries
110
+ - This model is suitable for academic usages
111
+ - Speech quantization, codec and tokenization for English utterances
112
+
113
+ ## Evaluation results
114
+
115
+ ### Training configuration
116
+ - Feature info: raw waveform input
117
+ - Train info: Adam, lr 3e-4, batch_size 32, 2 gpu(Tesla V100), acc_grad 1, 300000 steps, speech_max_length 51200
118
+ - Loss info: L1, L2, discriminative loss
119
+ - Model info: SEANet, Conv, LSTM
120
+ - Train config: encodec_16k_n32_600k_step.yaml
121
+ - Model size: 15.14 M parameters
122
+
123
+ ### Experimental Results
124
+ Test set: LibriTTS test-clean, ViSQOL scores
125
+ | testset | 50 tk/s | 100 tk/s | 200 tk/s | 400 tk/s |
126
+ |:--------:|:--------:|:--------:|:--------:|:--------:|
127
+ | LibriTTS | 3.43 | 3.86 | 4.12 | 4.29 |
128
+
129
+ ### Limitations and bias
130
+ - Not very robust to background noises and reverberation
131
+
132
+ ### BibTeX entry and citation info
133
+ ```BibTeX
134
+ @misc{du2023funcodec,
135
+ title={FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec},
136
+ author={Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng},
137
+ year={2023},
138
+ eprint={2309.07405},
139
+ archivePrefix={arXiv},
140
+ primaryClass={cs.Sound}
141
+ }
142
+ ```
config.yaml ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ config: conf/encodec_lstm_16k_n32_600k_step_rmseg_use_power.yaml
2
+ print_config: false
3
+ log_level: INFO
4
+ dry_run: false
5
+ iterator_type: sequence
6
+ output_dir: exp/encodec_lstm_16k_n32_600k_step_rmseg_use_power_raw_en_libritts
7
+ ngpu: 1
8
+ seed: 0
9
+ num_workers: 8
10
+ num_att_plot: 0
11
+ dist_backend: nccl
12
+ dist_init_method: env://
13
+ dist_world_size: null
14
+ dist_rank: null
15
+ local_rank: 0
16
+ dist_master_addr: null
17
+ dist_master_port: null
18
+ dist_launcher: null
19
+ multiprocessing_distributed: true
20
+ unused_parameters: true
21
+ sharded_ddp: false
22
+ cudnn_enabled: true
23
+ cudnn_benchmark: false
24
+ cudnn_deterministic: false
25
+ collect_stats: false
26
+ write_collected_feats: false
27
+ max_epoch: 60
28
+ max_update: 9223372036854775807
29
+ patience: null
30
+ val_scheduler_criterion:
31
+ - valid
32
+ - loss
33
+ early_stopping_criterion:
34
+ - valid
35
+ - loss
36
+ - min
37
+ best_model_criterion:
38
+ - - valid
39
+ - generator_multi_spectral_recon_loss
40
+ - min
41
+ keep_nbest_models: 60
42
+ nbest_averaging_interval: 0
43
+ grad_clip: -1
44
+ grad_clip_type: 2.0
45
+ grad_noise: false
46
+ accum_grad: 1
47
+ no_forward_run: false
48
+ resume: true
49
+ train_dtype: float32
50
+ use_amp: false
51
+ log_interval: 50
52
+ use_tensorboard: true
53
+ use_wandb: false
54
+ wandb_project: null
55
+ wandb_id: null
56
+ wandb_entity: null
57
+ wandb_name: null
58
+ wandb_model_log_interval: -1
59
+ detect_anomaly: false
60
+ pretrain_path: null
61
+ init_param: []
62
+ ignore_init_mismatch: true
63
+ freeze_param: []
64
+ num_iters_per_epoch: 10000
65
+ batch_size: 32
66
+ valid_batch_size: null
67
+ batch_bins: 1000000
68
+ valid_batch_bins: null
69
+ drop_last: true
70
+ train_shape_file:
71
+ - exp/tokenizer_states_16k/train/speech_shape
72
+ valid_shape_file:
73
+ - exp/tokenizer_states_16k/dev/speech_shape
74
+ batch_type: unsorted
75
+ valid_batch_type: null
76
+ speech_length_min: -1
77
+ speech_length_max: -1
78
+ fold_length:
79
+ - 512
80
+ - 150
81
+ sort_in_batch: descending
82
+ sort_batch: descending
83
+ multiple_iterator: false
84
+ chunk_length: 500
85
+ chunk_shift_ratio: 0.5
86
+ num_cache_chunks: 1024
87
+ dataset_type: small
88
+ dataset_conf: {}
89
+ train_data_file: null
90
+ valid_data_file: null
91
+ train_data_path_and_name_and_type:
92
+ - - dump/raw_16k/train/wav.scp.pai
93
+ - speech
94
+ - kaldi_ark
95
+ valid_data_path_and_name_and_type:
96
+ - - dump/raw_16k/dev/wav.scp.pai
97
+ - speech
98
+ - kaldi_ark
99
+ allow_variable_data_keys: false
100
+ max_cache_size: 0.0
101
+ max_cache_fd: 32
102
+ valid_max_cache_size: null
103
+ optim: adam
104
+ optim_conf:
105
+ lr: 0.0003
106
+ betas:
107
+ - 0.5
108
+ - 0.9
109
+ scheduler: null
110
+ scheduler_conf:
111
+ step_size: 8
112
+ gamma: 0.1
113
+ optim2: adam
114
+ optim2_conf:
115
+ lr: 0.0003
116
+ betas:
117
+ - 0.5
118
+ - 0.9
119
+ scheduler2: null
120
+ scheduler2_conf:
121
+ step_size: 8
122
+ gamma: 0.1
123
+ simple_ddp: false
124
+ num_worker_count: 1
125
+ generator_first: false
126
+ input_size: 1
127
+ cmvn_file: null
128
+ disc_grad_clip: -1
129
+ disc_grad_clip_type: 2.0
130
+ gen_train_interval: 1
131
+ disc_train_interval: 1
132
+ use_preprocessor: true
133
+ speech_volume_normalize: null
134
+ speech_rms_normalize: false
135
+ speech_max_length: 40000
136
+ sampling_rate: 16000
137
+ valid_max_length: 40000
138
+ frontend: null
139
+ frontend_conf: {}
140
+ normalize: null
141
+ normalize_conf: {}
142
+ encoder: encodec_seanet_encoder
143
+ encoder_conf:
144
+ norm: time_group_norm
145
+ causal: false
146
+ quantizer: costume_quantizer
147
+ quantizer_conf:
148
+ codebook_size: 1024
149
+ num_quantizers: 32
150
+ ema_decay: 0.99
151
+ kmeans_init: true
152
+ sampling_rate: 16000
153
+ quantize_dropout: true
154
+ rand_num_quant:
155
+ - 1
156
+ - 2
157
+ - 4
158
+ - 8
159
+ - 16
160
+ - 32
161
+ use_ddp: true
162
+ encoder_hop_length: 320
163
+ decoder: encodec_seanet_decoder
164
+ decoder_conf:
165
+ norm: time_group_norm
166
+ causal: false
167
+ model: encodec
168
+ model_conf:
169
+ odim: 128
170
+ multi_spectral_window_powers_of_two:
171
+ - 5
172
+ - 6
173
+ - 7
174
+ - 8
175
+ - 9
176
+ - 10
177
+ target_sample_hz: 16000
178
+ audio_normalize: true
179
+ segment_dur: null
180
+ overlap_ratio: null
181
+ use_power_spec_loss: true
182
+ discriminator: multiple_disc
183
+ discriminator_conf:
184
+ disc_conf_list:
185
+ - filters: 32
186
+ name: encodec_multi_scale_stft_discriminator
187
+ distributed: false
188
+ version: 0.2.0
example/example.wav ADDED
Binary file (161 kB). View file
 
fig/framework.png ADDED
model.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9cd0de900178ba5b7cc5f83977f445cc8e35d6ad84e5ed8a7f9ae744c842aeb3
3
+ size 95149521