A Lightweight Neural TTS System for High-quality German Speech Synthesis

Konferenz: Speech Communication - 14th ITG Conference
29.09.2021 - 01.10.2021 in online

Tagungsband: ITG-Fb. 298: Speech Communication

Seiten: 5Sprache: EnglischTyp: PDF

Persönliche VDE-Mitglieder erhalten auf diesen Artikel 10% Rabatt

Govalkar, Prachi; Mustafa, Ahmed; Pia, Nicola; Bauer, Judith; Yurt, Metehan; Dittmar, Christian (Fraunhofer IIS, Erlangen, Germany)
Oezer, Yigitcan (International Audio Laboratories Erlangen, Germany)

This paper describes a lightweight neural text-to-speech system for the German language. The system is composed of a non-autoregressive spectrogram predictor, followed by a recently proposed neural vocoder called StyleMelGAN. Our complete system has a very tiny footprint of 61MB and is able to synthesize high-quality speech output faster than real-time both on CPU (2.55x) and GPU (50.29x). We additionally propose a modified version of the vocoder called Multi-band StyleMelGAN, which offers a significant improvement in inference speed with a small tradeoff in speech quality. In a perceptual listening test with the complete TTS pipeline, the best configuration achieves a mean opinion score of 3.84 using StyleMelGAN, compared to 4.23 for professional speech recordings.