Robust Multimodal Human Machine Interaction using the Kinect Sensor

Conference: Speech Communication - 11. ITG-Fachtagung Sprachkommunikation
09/24/2014 - 09/26/2014 at Erlangen, Deutschland

Proceedings: Speech Communication

Pages: 4Language: englishTyp: PDF

Personal VDE Members are entitled to a 10% discount on this title

Authors:
Zeiler, Steffen; Cwiklak, Jan; Kolossa, Dorothea (Cognitive Signal Processing Group, Ruhr-Universitaet Bochum, 44801 Bochum, Germany)

Abstract:
Distant-talking automatic speech recognition is still not sufficiently robust for everyday use in noisy real-room environments. In the following, we will consider audiovisual speech recognition as an alternative approach for the specific toy scenario of controlling computer chess via voice, even while listening to music or newscasts. Firstly, we describe a system design that is suitable for use with real-time speech input, modeling audiovisual speech in a flexible, simple way that allows for some asynchrony, as it is also observed with real-time data in audiovisual sensor systems like the Kinect. We secondly focus on the selection of the video features and on their dimensionality reduction, comparing and fusing three feature types: a DCT of the mouth region, facial action unit features, and 3D-locations of facial landmarks. The utility of the presented system is finally tested in highly noisy environments, with speech-to-noise ratios around zero dB, and the performance of two audiovisual feature sets is compared on this task.