Robust Online Multi-Channel Speech Recognition

Konferenz: Speech Communication - 12. ITG-Fachtagung Sprachkommunikation
05.10.2016 - 07.10.2016 in Paderborn, Deutschland

Tagungsband: ITG-Fb. 267: Speech Communication

Seiten: 5Sprache: EnglischTyp: PDF

Persönliche VDE-Mitglieder erhalten auf diesen Artikel 10% Rabatt

Kitza, Markus; Zeyer, Albert; Schlueter, Ralf (Human Language Technology and Pattern Recognition, RWTH Aachen, 52074 Aachen, Germany)
Heymann, Jahn; Haeb-Umbach, Reinhold (Department of Communications Engineering Paderborn, Paderborn University, 33098 Paderborn, Germany)

In this paper we present a system for robust online far-field multi-channel speech recognition with minimal assumptions on microphone configuration and target location. We employ an online-enabled Generalized Eigenvalue (GEV) beamformer and a Long Short-Term Memory (LSTM) network to robustly calculate the signal statistics necessary for the beamforming operation in the front-end. After multiple channels have been condensed to one, a Bidirectional Long Short-Term Memory (BLSTM) acoustic model is applied on a running window of input speech. This enables online decoding in combination with the beamforming front-end. To assess the performance of the system we test it on the real evaluation set of the CHiME 3 data where we achieve a Word Error Rate (WER) of 10.4 %.