|
--- |
|
language: |
|
- ca |
|
licence: |
|
- apache-2.0 |
|
tags: |
|
- matcha-tts |
|
- acoustic modelling |
|
- speech |
|
- multispeaker |
|
pipeline_tag: text-to-speech |
|
datasets: |
|
- projecte-aina/festcat_trimmed_denoised |
|
- projecte-aina/openslr-slr69-ca-trimmed-denoised |
|
--- |
|
|
|
# Matcha-TTS Catalan Multispeaker |
|
|
|
## Table of Contents |
|
<details> |
|
<summary>Click to expand</summary> |
|
|
|
- [Model description](#model-description) |
|
- [Intended uses and limitations](#intended-uses-and-limitations) |
|
- [How to use](#how-to-use) |
|
- [Training](#training) |
|
- [Evaluation](#evaluation) |
|
- [Citation](#citation) |
|
- [Additional information](#additional-information) |
|
|
|
</details> |
|
|
|
## Model Description |
|
|
|
**Matcha-TTS** is an encoder-decoder architecture designed for fast acoustic modelling in TTS. |
|
The encoder part is based on a text encoder and a phoneme duration prediction that together predict averaged acoustic features. |
|
And the decoder has essentially a U-Net backbone inspired by [Grad-TTS](https://arxiv.org/pdf/2105.06337.pdf), which is based on the Transformer architecture. |
|
In the latter, by replacing 2D CNNs by 1D CNNs, a large reduction in memory consumption and fast synthesis is achieved. |
|
|
|
**Matcha-TTS** is a non-autorregressive model trained with optimal-transport conditional flow matching (OT-CFM). |
|
This yields an ODE-based decoder capable of generating high output quality in fewer synthesis steps than models trained using score matching. |
|
|
|
## Intended Uses and Limitations |
|
|
|
This model is intended to serve as an acoustic feature generator for multispeaker text-to-speech systems for the Catalan language. |
|
It has been finetuned using a Catalan phonemizer, therefore if the model is used for other languages it may will not produce intelligible samples after mapping |
|
its output into a speech waveform. |
|
|
|
The quality of the samples can vary depending on the speaker. |
|
This may be due to the sensitivity of the model in learning specific frequencies and also due to the quality of samples for each speaker. |
|
|
|
## How to Use |
|
|
|
### Installation |
|
|
|
This model has been trained using the espeak-ng open source text-to-speech software. |
|
The espeak-ng containing the Catalan phonemizer can be found [here](https://github.com/projecte-aina/espeak-ng) |
|
|
|
Create a virtual environment: |
|
```bash |
|
python -m venv /path/to/venv |
|
``` |
|
```bash |
|
source /path/to/venv/bin/activate |
|
``` |
|
|
|
For training and inferencing with Catalan Matcha-TTS you need to compile the provided espeak-ng with the Catalan phonemizer: |
|
```bash |
|
git clone https://github.com/projecte-aina/espeak-ng.git |
|
|
|
export PYTHON=/path/to/env/<env_name>/bin/python |
|
cd /path/to/espeak-ng |
|
./autogen.sh |
|
./configure --prefix=/path/to/espeak-ng |
|
make |
|
make install |
|
|
|
pip cache purge |
|
pip install mecab-python3 |
|
pip install unidic-lite |
|
``` |
|
Install the repository: |
|
```bash |
|
pip install git+https://github.com/langtech-bsc/Matcha-TTS.git@dev-cat |
|
|
|
``` |
|
|
|
### For Inference |
|
|
|
#### PyTorch |
|
|
|
Speech inference can be done with **Catalan Matcha-TTS** by loading the model remotely with the HF hub. |
|
|
|
#### ONNX |
|
|
|
We also release a ONNX version of the model |
|
|
|
### For Training |
|
|
|
The entire checkpoint is also release to continue with pretraining or finetuning. |
|
|
|
## Training Details |
|
|
|
### Training data |
|
|
|
The model was trained on 2 **Catalan** speech datasets |
|
|
|
| Dataset | Language | Hours | Num. Speakers | |
|
|---------------------|----------|---------|-----------------| |
|
| [Festcat](https://huggingface.co./datasets/projecte-aina/festcat_trimmed_denoised) | ca | 22 | 11 | |
|
| [OpenSLR69](https://huggingface.co./datasets/projecte-aina/openslr-slr69-ca-trimmed-denoised) | ca | 5 | 36 | |
|
|
|
### Training procedure |
|
|
|
***Catalan Matcha-TTS*** was not trained from scratch. Instead, we finetuned the model from the English multispeaker checkpoint |
|
(trained with the [VCTK dataset](https://huggingface.co./datasets/vctk)) provided by the authors. |
|
The embedding layer was initialized with the number of catalan speakers (47) and original hyperparameters were kept. |
|
|
|
### Training Hyperparameters |
|
|
|
* batch size: 32 (x2 GPUs) |
|
* learning rate: 1e-4 |
|
* number of speakers: 47 |
|
* n_fft: 1024 |
|
* n_feats: 80 |
|
* sample_rate: 22050 |
|
* hop_length: 256 |
|
* win_length: 1024 |
|
* f_min: 0 |
|
* f_max: 8000 |
|
* data_statistics: |
|
* mel_mean: -6578195 |
|
* mel_std: 2.538758 |
|
* number of samples: 13340 |
|
|
|
## Evaluation |
|
|
|
Validation values obtained from tensorboard from epoch 2399: |
|
(Note that the finetuning started from epoch 1864, as previous ones were trained with VCTK dataset) |
|
|
|
* val_dur_loss_epoch: 0.38 |
|
* val_prior_loss_epoch: 0.97 |
|
* val_diff_loss_epoch: 2.195 |
|
|
|
## Citation |
|
|
|
If this code contributes to your research, please cite the work: |
|
|
|
``` |
|
@misc{mehta2024matchatts, |
|
title={Matcha-TTS: A fast TTS architecture with conditional flow matching}, |
|
author={Shivam Mehta and Ruibo Tu and Jonas Beskow and Éva Székely and Gustav Eje Henter}, |
|
year={2024}, |
|
eprint={2309.03199}, |
|
archivePrefix={arXiv}, |
|
primaryClass={eess.AS} |
|
} |
|
``` |
|
|
|
## Additional Information |
|
|
|
### Author |
|
The Language Technologies Unit from Barcelona Supercomputing Center. |
|
|
|
### Contact |
|
For further information, please send an email to <[email protected]>. |
|
|
|
### Copyright |
|
Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center. |
|
|
|
### License |
|
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
|
|
|
### Funding |
|
This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/). |
|
|