Improving the Naturalness of Synthesized Spectrograms for TTS Using GANBased Post-Processing

Konferenz: Speech Communication - 15th ITG Conference
20.09.2023-22.09.2023 in Aachen

doi:10.30420/456164053

Tagungsband: ITG-Fb. 312: Speech Communication

Seiten: 5Sprache: EnglischTyp: PDF

Autoren:
Sani, Paolo; Bauer, Judith; Zalkow, Frank; Dittmar, Christian (Fraunhofer IIS, Erlangen, Germany)
Habets, Emanuel A. P. (Fraunhofer IIS, Erlangen, Germany & International Audio Laboratories Erlangen, Germany)

Inhalt:
Recent text-to-speech (TTS) architectures usually synthesize speech in two stages. Firstly, an acoustic model predicts a compressed spectrogram from text input. Secondly, a neural vocoder converts the spectrogram into a time-domain audio signal. However, the synthesized spectrograms often substantially differ from real-world spectrograms. In particular, they miss fine-grained details, which is referred to as the “over-smoothing effect.” Consequently, the audio signals generated by the vocoder may contain audible artifacts. We propose a spectrogram post-processing model based on generative adversarial networks (GANs) to improve the naturalness of synthesized spectrograms. In our experiments, we use acoustic models of varying quality (yielding different degrees of artifacts) and conduct listening tests, which show that our approach can substantially improve the naturalness of synthesized spectrograms. This improvement is especially significant for highly degraded spectrograms, which miss fine-grained details or harmonic content.