arxiv:2006.01463

An ASR Guided Speech Intelligibility Measure for TTS Model Selection

Published on Jun 2, 2020

Authors:

Abstract

The perceptual quality of neural text-to-speech (TTS) is highly dependent on the choice of the model during training. Selecting the model using a training-objective metric such as the least mean squared error does not always correlate with human perception. In this paper, we propose an objective metric based on the phone error rate (PER) to select the TTS model with the best speech intelligibility. The PER is computed between the input text to the TTS model, and the text decoded from the synthesized speech using an automatic speech recognition (ASR) model, which is trained on the same data as the TTS model. With the help of subjective studies, we show that the TTS model chosen with the least PER on validation split has significantly higher speech intelligibility compared to the model with the least training-objective metric loss. Finally, using the proposed PER and subjective evaluation, we show that the choice of best TTS model depends on the genre of the target domain text. All our experiments are conducted on a Hindi language dataset. However, the proposed model selection method is language independent.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2006.01463 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2006.01463 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2006.01463 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.