Multimodal ASR by Turbo Decoding vs. Feature Concatenation: Where to Perform Information Integration?

Conference: Speech Communication - 11. ITG-Fachtagung Sprachkommunikation
09/24/2014 - 09/26/2014 at Erlangen, Deutschland

Proceedings: Speech Communication

Pages: 4Language: englishTyp: PDF

Personal VDE Members are entitled to a 10% discount on this title

Authors:
Receveur, Simon; Weiss, Robin; Fingscheidt, Tim (Institute for Communications Technology, Technische Universitaet Braunschweig, 38106 Braunschweig, Germany)

Abstract:
To achieve robustness against environmental interferences, the incorporation of visual information has been shown as effective approach to robust automatic speech recognition (ASR). However, still questionable in multimodal speech processing is the optimal stage of information integration. Considering results from multiple-in-single-out (MISO) mobile communications suggesting early integration levels, multimodal ASR may suffer from early integration due to inherent asynchrony of audio and video features. In this paper we investigate whether early or middle integration strategies perform best in multimodal ASR by comparing feature concatenation and turbo decoding approaches. Applied to an audio-visual speech recognition task on a large database, we show the significant benefit of turbo ASR approaches (middle integration) over early integration feature vector concatenation outperforming these by about 13% absolute at a signal-to-noise ratio (SNR) of 0 dB.