Neural Prosody Prediction for German Articulatory Speech Synthesis
Conference: Speech Communication - 16th ITG Conference
09/24/2025 - 09/26/2025 at Berlin, Germany
Proceedings: ITG-Fb. 321: Speech Communication
Pages: 5Language: englishTyp: PDF
Authors:
Steiner, Peter; Huang, Zihao; Fietkau, Arne-Lukas; Birkholz, Peter
Abstract:
This paper presents a prosody prediction model for German articulatory speech synthesis. Its inputs are phoneme sequences augmented with linguistic attributes, i.e., word boundaries, phrase boundaries, and accent information. Its outputs are the duration of each phoneme and three f0 values at 20%, 50%, and 80% of the time interval of the phoneme. The neural model was a bidirectional recurrent neural network trained on the 1683 sentences of the BITS-US corpus from a male German speaker. The focus of this study was to investigate the effect of four different phoneme embedding techniques on the model performance. The results show that the model using pretrained embeddings, fine-tuned for the prosody prediction task, was superior in predicting both phoneme duration and f0. The results were confirmed by a listening experiment in which subjects were asked to compare sentences synthesized by the VocalTractLab speech synthesizer with the prosody variants predicted using the different embeddings.

