A Very-Low Delay High-Performance Speech Vocoder Based on the Encodec Speech Decoder
Konferenz: Speech Communication - 16th ITG Conference
24.09.2025-26.09.2025 in Berlin, Germany
Tagungsband: ITG-Fb. 321: Speech Communication
Seiten: 5Sprache: EnglischTyp: PDF
Autoren:
Shi, Renzheng; Fingscheidt, Tim
Inhalt:
Neural vocoders demonstrated superior synthesized speech quality. However, their sequence-to-sequence synthesis prohibits low-latency conversational applications. Introducing causal convolutions for low-delay synthesis often results in noticeable quality degradation. In our work, we propose a high-performance low-delay vocoder. First, we tailor the decoder of the advanced speech codec Encodec to a speech vocoder conditioned on Mel spectrogram input. Second, we investigate several topological changes to enhance the synthesized speech. Third, we leverage the large-scale training procedure from BigVGAN. In a speaker-independent wideband speech setup, our proposed lowdelay vocoder achieves a subjective MOS score (by ITU-T P.808) of 4.05, excelling all investigated baselines in all quality metrics, while being computationally efficient and offering an only 20 ms algorithmic delay instead of sequence-to-sequence processing. Accordingly, our vocoder marks a new state of the art in its class.

