Two-Dimensional Embeddings for Low-Resource Keyword Spotting Based on Dynamic Time Warping

Konferenz: Speech Communication - 14th ITG Conference
29.09.2021 - 01.10.2021 in online

Tagungsband: ITG-Fb. 298: Speech Communication

Seiten: 5Sprache: EnglischTyp: PDF

Persönliche VDE-Mitglieder erhalten auf diesen Artikel 10% Rabatt

Wilkinghoff, Kevin; Cornaggia-Urrigshardt, Alessia; Goekgoez, Fahrettin (Fraunhofer Institute for Communication, Information Processing and Ergonomics FKIE, Wachtberg, Germany)

State-of-the-art keyword spotting systems consist of neural networks trained as classifiers or trained to extract discriminative representations, so-called embeddings. However, a sufficient amount of labeled data is needed to train such a system. Dynamic time warping is another keyword spotting approach that uses only a single sample of each keyword as patterns to be searched and thus does not require any training. In this work, we propose to combine the strengths of both keyword spotting approaches in two ways: First, an angular margin loss for training a neural network to extract two-dimensional embeddings is presented. It is shown that these embeddings can be used as features for dynamic time warping, outperforming cepstral features even when very few training samples are available. Second, dynamic time warping is applied to cepstral features to turn weak into strong labels and thus provide more labeled training data for the two-dimensional embeddings.