Ultra-low power acoustic front-ends for natural language user interfaces

Conference: VDE-Kongress 2016 - Internet der Dinge
11/07/2016 - 11/08/2016 at Mannheim, Deutschland

Proceedings: VDE-Kongress 2016 – Internet der Dinge

Pages: 5Language: englishTyp: PDF

Personal VDE Members are entitled to a 10% discount on this title

Authors:
Fischer, Johannes (International Audio Laboratories Erlangen, a joint institution of the Friedrich-Alexander-Universität Erlangen, Germany)
Bhardwaj, Kanav; Breiling, Marco; Leyh, Martin; Bäckström, Tom (International Audio Laboratories Erlangen, Germany & Fraunhofer IIS, Am Wolfsmantel 33, 91058 Erlangen, Germany)

Abstract:
In the era of the internet of things (IoT) the number of smart devices with permanent access to the internet is expected to increase. As the range of functions of these devices has increased rapidly, controlling them with conventional human-machine-interfaces (HMIs) can be bothersome. Thus, a more natural interaction without the need of cumbersome menus would be the implementation of natural language user interfaces (NLUIs). However, NLUIs are based on speech recognition frameworks which are computational complex and therefore unpractical for small, battery powered devices. Thus a preprocessing stage like a voice-activity-detection (VAD) can be implemented, to power up such a framework only when speech is present, to save power and prolong battery life. We implemented a low power VAD, composed of two different stages. The first evaluating features in the time-domain, the second complementing them by frequency domain features, both applying a linear classification scheme. The proposed approach was evaluated in different conditions, with signals degraded by noise and reverberation. We show that the proposed approach has a low computational complexity, while attaining error-rates smaller than 10 % even under adverse conditions. Moreover, we implemented a simple keyword spotting algorithm (KS), based on mel-frequency cepstral coefficientss (MFCCs), a linear classifier and a sequence detector. Based on this simple scheme, the achieved recognition rate was between 50% and 80% under non-reverberant conditions, though the performance drops with increased reverberation.