An Improved Neural Network Architecture for Target Speech Extraction

Konferenz: Speech Communication - 16th ITG Conference
24.09.2025-26.09.2025 in Berlin, Germany

Tagungsband: ITG-Fb. 321: Speech Communication

Seiten: 5Sprache: EnglischTyp: PDF

Autoren:
Joos, David; Faubel, Friedrich; Jungclaussen, Jonas; Buck, Markus; Minker, Wolfgang

Inhalt:
In this work, we present an improved architecture for neural- network based target speech extraction. The main idea can be described as having a dedicated path in which the target speaker embedding is adapted to the input signal. The rationale behind this approach is to translate longterm characteristics of a pre-trained voice print into shortterm information that can be exploited at the frame level. We show that this architecture exceeds the current stateof- the-art, which consists in applying a static transformation of the embedding independently of the input signal. Next to exhaustive ablation studies that corroborate the proposed design, we provide a comparative evaluation with the SpeakerBeam approach. The separation performance is evaluated by means of objective speech quality metrics such as SI-SDR, PESQ and STOI. In addition to this, we show voice recognition results.