Personalized Speech Synthesis for Zero-Shot Keyword Spotting

Konferenz: Speech Communication - 16th ITG Conference
24.09.2025-26.09.2025 in Berlin, Germany

Tagungsband: ITG-Fb. 321: Speech Communication

Seiten: 5Sprache: EnglischTyp: PDF

Autoren:
Goekgoez, Fahrettin; Cornaggia-Urrigshardt, Alessia; Wilkinghoff, Kevin

Inhalt:
Usually, keyword spotting (KWS) systems can only detect the specific keywords they were trained to detect. Moreover, a sufficiently large number of spoken samples needs to be provided for each keyword, which may be impractical, particularly in dynamic applications. In this work, we present a methodology for generating synthetic speech samples to enhance KWS models, specifically to adapt to unseen words associated with known speakers. We first refine the SoundStream neural encoder to achieve highquality encoding and decoding of the target speaker’s voice. Subsequently, we adapt the SpearTTS model to create phonetically diverse sentences through a use-case generator module. The generated sentences are then strongly labeled to capture individual words. In experiments, we trained a template-based KWS model using this synthetic dataset and evaluated its performance against a set of real-world data. Our findings demonstrate the efficacy of synthetic data in improving KWS adaptability to new vocabularies.