maskgct

Runtime error

App Files Files Community

maskgct / egs /codec /FAcodec /README.md

Hecheng0625

Upload 167 files

8c92a11 verified 20 days ago

preview code

raw

history blame

2.91 kB

	# FAcodec

	Pytorch implementation for the training of FAcodec, which was proposed in paper [NaturalSpeech 3: Zero-Shot Speech Synthesis
	with Factorized Codec and Diffusion Models](https://arxiv.org/pdf/2403.03100)

	A dedicated repository for the FAcodec model can also be find [here](https://github.com/Plachtaa/FAcodec).

	This implementation made some key improvements to the training pipeline, so that the requirements of any form of annotations, including
	transcripts, phoneme alignments, and speaker labels, are eliminated. All you need are simply raw speech files.
	With the new training pipeline, it is possible to train the model on more languages with more diverse timbre distributions.
	We release the code for training and inference, including a pretrained checkpoint on 50k hours speech data with over 1 million speakers.

	## Model storage
	We provide pretrained checkpoints on 50k hours speech data.

	\| Model type \| Link \|
	\|-------------------\|----------------------------------------------------------------------------------------------------------------------------------------\|
	\| FAcodec \| [![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-FAcodec-blue)](https://huggingface.co./Plachta/FAcodec) \|

	## Demo
	Try our model on [![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-Space-blue)](https://huggingface.co./spaces/Plachta/FAcodecV2)!

	## Training
	Prepare your data and put them under one folder, internal file structure does not matter.
	Then, change the `dataset` in `./egs/codec/FAcodec/exp_custom_data.json` to the path of your data folder.
	Finally, run the following command:
	```bash
	sh ./egs/codec/FAcodec/train.sh
	```

	## Inference
	To reconstruct a speech file, run:
	```bash
	python ./bins/codec/inference.py --source <source_wav> --output_dir <output_dir> --checkpoint_path <checkpoint_path>
	```
	To use zero-shot voice conversion, run:
	```bash
	python ./bins/codec/inference.py --source <source_wav> --reference <reference_wav> --output_dir <output_dir> --checkpoint_path <checkpoint_path>
	```

	## Feature extraction
	When running `./bins/codec/inference.py`, check the returned results of the `FAcodecInference` class: a tuple of `(quantized, codes)`
	- `quantized` is the quantized representation of the input speech file.
	- `quantized[0]` is the quantized representation of prosody
	- `quantized[1]` is the quantized representation of content

	- `codes` is the discrete code representation of the input speech file.
	- `codes[0]` is the discrete code representation of prosody
	- `codes[1]` is the discrete code representation of content

	For the most clean content representation without any timbre, we suggest to use `codes[1][:, 0, :]`, which is the first layer of content codebooks.