Sequence Modeling and Alignment for LVCSR-Systems

Konferenz: Speech Communication - 13. ITG-Fachtagung Sprachkommunikation
10.10.2018 - 12.10.2018 in Oldenburg, Deutschland

Tagungsband: ITG-Fb. 282: Speech Communication

Seiten: 5Sprache: EnglischTyp: PDF

Persönliche VDE-Mitglieder erhalten auf diesen Artikel 10% Rabatt

Beck, Eugen; Zeyer, Albert; Doetsch, Patrick; Merboldt, Andre; Schlueter, Ralf; Ney, Hermann (Lehrstuhl Informatik 6, RWTH Aachen University, Germany)

Today, modeling automatic speech recognition (ASR) systems using deep neural networks (DNNs) has led to considerable improvements in performance, with word error rates being approximately halved compared to the status we had 10 to 15 years ago. Current state-of-the-art systems, at least if they are trained on moderate to medium amounts of training data, still follow the classical separation into language models and generative acoustic models. Acoustic modeling in these systems follows the socalled hybrid HMM approach. However, in the last years, many efforts were started to derive end-to-end models for ASR, which naturally follow the discriminative structure of neural networks. These include alternative solutions for the alignment problem underlying ASR, which in classical systems has been solved using hidden Markov models (HMMs). In this work we discuss and analyze two novel approaches to DNN-based ASR, the attention-based encoder–decoder approach, and the (segmental) inverted HMM approach. Experimental results are presented on the well-known Switchboard corpus and are compared against the standard hybrid approach, with specific focus on the sequence alignment behavior of the different approaches.