Robust Multimodal Human Machine Interaction using the Kinect Sensor

Konferenz: Speech Communication - 11. ITG-Fachtagung Sprachkommunikation
24.09.2014 - 26.09.2014 in Erlangen, Deutschland

Tagungsband: Speech Communication

Seiten: 4Sprache: EnglischTyp: PDF

Persönliche VDE-Mitglieder erhalten auf diesen Artikel 10% Rabatt

Autoren:
Zeiler, Steffen; Cwiklak, Jan; Kolossa, Dorothea (Cognitive Signal Processing Group, Ruhr-Universitaet Bochum, 44801 Bochum, Germany)

Inhalt:
Distant-talking automatic speech recognition is still not sufficiently robust for everyday use in noisy real-room environments. In the following, we will consider audiovisual speech recognition as an alternative approach for the specific toy scenario of controlling computer chess via voice, even while listening to music or newscasts. Firstly, we describe a system design that is suitable for use with real-time speech input, modeling audiovisual speech in a flexible, simple way that allows for some asynchrony, as it is also observed with real-time data in audiovisual sensor systems like the Kinect. We secondly focus on the selection of the video features and on their dimensionality reduction, comparing and fusing three feature types: a DCT of the mouth region, facial action unit features, and 3D-locations of facial landmarks. The utility of the presented system is finally tested in highly noisy environments, with speech-to-noise ratios around zero dB, and the performance of two audiovisual feature sets is compared on this task.