Target Speaker Extraction: the Importance of a Powerful Extractor and Content-Informed Embeddings
Conference: Speech Communication - 16th ITG Conference
09/24/2025 - 09/26/2025 at Berlin, Germany
Proceedings: ITG-Fb. 321: Speech Communication
Pages: 5Language: englishTyp: PDF
Authors:
De Souter, Elias; Kindt, Stijn; Yang, Kaixuan; Zhao, Haixin; Song, Siyuan; Song, Yanjue; Madhu, Nilesh
Abstract:
In single-channel Target Speaker Extraction (TSE), the objective is to isolate a desired speaker from a mixture containing interfering speakers and background noise using a single microphone. This paper adapts the Convolutional Recurrent U-net for Speech Enhancement (CRUSE) architecture for TSE by integrating speaker embeddings derived from the ECAPA-TDNN model at the network bottleneck. We investigate three embedding integration strategies: (1) direct concatenation with bottleneck features, (2) Feature-wise Linear Modulation (FiLM), and (3) attention-based fusion. Additionally, we compare the effectiveness of content-aware embeddings, derived from the current mixture, against static global embeddings. Our experiments reveal that content-aware embeddings significantly improve extraction quality in complex acoustic conditions. Moreover, we find that allocating greater computational capacity to the bottleneck extractor is more beneficial than increasing fusion complexity. Notably, simple concatenation combined with a stronger bottleneck outperforms more complex fusion strategies such as FiLM and attention, despite similar or lower overall model complexity.

