Towards Complex-Valued VAE-Based Distillation for Representation Learning in Speech Enhancement

Konferenz: Speech Communication - 16th ITG Conference
24.09.2025-26.09.2025 in Berlin, Germany

Tagungsband: ITG-Fb. 321: Speech Communication

Seiten: 5Sprache: EnglischTyp: PDF

Autoren:
Zhao, Haixin; Yang, Kaixuan; Madhu, Nilesh

Inhalt:
We propose a Complex-valued Variational AutoEncoder (CVAE)-based distillation framework addressing improved speech-noise disentanglement in representation learning models for speech enhancement. The teacher model, built upon computationally intensive yet high-performance stacked transformer architectures, incorporates a cascaded predictive sub-network and a CVAE. Through this, we aim to better capture speech-noise interaction in the latent space, thus enabling the generation of fine-grained prior representations. A closed-form expression for the general complexvalued Kullback-Leibler (KL) divergence, parametrised by mean, variance, and pseudo-covariance, is derived to guide the distillation process in the latent space. Employing the additional symmetric KL divergence loss, the student model - a lightweight, causal Distilled CVAE (D-CVAE) - outperforms the baseline CVAE across all instrumental metrics. It also achieves enhancement performance comparable to a state-of-the-art lightweight predictive model employing stacked frequency-time-frequency (FTF) transformers. At the same time, it maintains a well-regularised latent space, highlighting its effectiveness in both representation learning and speech-noise modelling.