Multi-Speaker Text-to-Speech Using ForwardTacotron with Improved Duration Prediction

Konferenz: Speech Communication - 15th ITG Conference
20.09.2023-22.09.2023 in Aachen

doi:10.30420/456164036

Tagungsband: ITG-Fb. 312: Speech Communication

Seiten: 5Sprache: EnglischTyp: PDF

Autoren:
Kayyar Lakshminarayana, Kishor; Dittmar, Christian; Pia, Nicola (Fraunhofer Institute for Integrated Circuits (IIS), Erlangen, Germany)
Habets, Emanuel A.P. (International Audio Laboratories Erlangen1, Erlangen, Germany)

Inhalt:
Several non-autoregressive methods for fast and efficient text-to-speech synthesis have been proposed. Most of these use a duration predictor to estimate the temporal sequence of phonemes in the speech. This duration prediction is based on the input phoneme sequence in a speakerindependent fashion. The resulting constant speech pace across speakers is unnatural since every human has a unique characteristic speed in talking. This paper proposes an extension of the multi-speaker ForwardTacotron to learn this aspect with trainable speaker embeddings. The duration of synthesized speech from the proposed model across multiple speakers is much closer to the durations of speech synthesized by a baseline auto-regressive model. The proposed extension yields marginal improvements in intelligibility as measured through an automated semantically unpredictable sentence test. Further, we show that the speech rhythm does not play a significant part in the perceptual quality assessment through a listening test.