Japanese GSLM

This is an Japanese implementation of Generative Spoken Language Model to support textless NLP in Japanese.
Submitted to Acoustical Society of Japan, 2023 Spring.

How to use

PyTorch version >= 1.10.0
Python version >= 3.8

Install requirements

It is pre-required to install the fairseq library and all the requirements the library needs.

git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./

pip install librosa, unidecode, inflect

Re-synthesis of voice signal

speech2unit

The procedure for speech2unit is the same as the gslm example in fairseq.

You can convert the Japanese voice signal to discrete unit through this pre-trained quantization model. Route the downloaded model to KM_MODEL_PATH.

This file replaces the HuBERT Base + KM200 model provided by fariseq, so it is required to download HuBERT-Base model as a pretrained acoustic model.

TYPE='hubert'
CKPT_PATH=<path_of_pretrained_acoustic_model>
LAYER=6
KM_MODEL_PATH=<output_path_of_the_kmeans_model>
MANIFEST=<tab_separated_manifest_of_audio_files_to_quantize>
OUT_QUANTIZED_FILE=<output_quantized_audio_file_path>

python examples/textless_nlp/gslm/speech2unit/clustering/quantize_with_kmeans.py \
    --feature_type $TYPE \
    --kmeans_model_path $KM_MODEL_PATH \
    --acoustic_model_path $CKPT_PATH \
    --layer $LAYER \
    --manifest_path $MANIFEST \
    --out_quantized_file_path $OUT_QUANTIZED_FILE \
    --extension ".wav"

unit2speech

unit2speech model is modified Tacotron2 model that learns to synthesize speech from discrete speech units. You can convert the discrete unit to synthesized voice through this model. Also, it is required to download Waveglow checkpoint for Vocoder.

Conversion from unit to speech is available with unit2speech_ja.py from this repository. It is also required to use hparam.py for extended compatability.

TTS_MODEL_PATH=<unit2speech_model_file_path>
OUT_DIR=<dir_to_dump_synthesized_audio_files>
WAVEGLOW_PATH=<path_where_you_have_downloaded_waveglow_checkpoint>

python unit2speech_ja.py \
    --tts_model_path $TTS_MODEL_PATH \
    --out_audio_dir $OUT_DIR \
    --waveglow_path  $WAVEGLOW_PATH \

References

Lakhotia, Kushal et al. On Generative Spoken Language Modeling from Raw Audio. Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021.
Ott, Myle et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, 2019.