A Lightweight Neural TTS System for High-quality German Speech Synthesis

Conference: Speech Communication - 14th ITG Conference
09/29/2021 - 10/01/2021 at online

Proceedings: ITG-Fb. 298: Speech Communication

Pages: 5Language: englishTyp: PDF

Personal VDE Members are entitled to a 10% discount on this title

Govalkar, Prachi; Mustafa, Ahmed; Pia, Nicola; Bauer, Judith; Yurt, Metehan; Dittmar, Christian (Fraunhofer IIS, Erlangen, Germany)
Oezer, Yigitcan (International Audio Laboratories Erlangen, Germany)

This paper describes a lightweight neural text-to-speech system for the German language. The system is composed of a non-autoregressive spectrogram predictor, followed by a recently proposed neural vocoder called StyleMelGAN. Our complete system has a very tiny footprint of 61MB and is able to synthesize high-quality speech output faster than real-time both on CPU (2.55x) and GPU (50.29x). We additionally propose a modified version of the vocoder called Multi-band StyleMelGAN, which offers a significant improvement in inference speed with a small tradeoff in speech quality. In a perceptual listening test with the complete TTS pipeline, the best configuration achieves a mean opinion score of 3.84 using StyleMelGAN, compared to 4.23 for professional speech recordings.