Two-Dimensional Embeddings for Low-Resource Keyword Spotting Based on Dynamic Time Warping

Conference: Speech Communication - 14th ITG Conference
09/29/2021 - 10/01/2021 at online

Proceedings: ITG-Fb. 298: Speech Communication

Pages: 5Language: englishTyp: PDF

Personal VDE Members are entitled to a 10% discount on this title

Wilkinghoff, Kevin; Cornaggia-Urrigshardt, Alessia; Goekgoez, Fahrettin (Fraunhofer Institute for Communication, Information Processing and Ergonomics FKIE, Wachtberg, Germany)

State-of-the-art keyword spotting systems consist of neural networks trained as classifiers or trained to extract discriminative representations, so-called embeddings. However, a sufficient amount of labeled data is needed to train such a system. Dynamic time warping is another keyword spotting approach that uses only a single sample of each keyword as patterns to be searched and thus does not require any training. In this work, we propose to combine the strengths of both keyword spotting approaches in two ways: First, an angular margin loss for training a neural network to extract two-dimensional embeddings is presented. It is shown that these embeddings can be used as features for dynamic time warping, outperforming cepstral features even when very few training samples are available. Second, dynamic time warping is applied to cepstral features to turn weak into strong labels and thus provide more labeled training data for the two-dimensional embeddings.