Multi-Head Fusion Attention for Transformer-Based End-to-End Automatic Speech Recognition

Conference: Speech Communication - 14th ITG Conference
09/29/2021 - 10/01/2021 at online

Proceedings: ITG-Fb. 298: Speech Communication

Pages: 5Language: englishTyp: PDF

Personal VDE Members are entitled to a 10% discount on this title

Lohrenz, Timo; Schwarz, Patrick; Li, Zhengyang; Fingscheidt, Tim (Institute for Communications Technology, Technische Universität Braunschweig, Braunschweig, Germany)

Stream fusion is a widely used technique in automatic speech recognition (ASR) to explore additional information for a better recognition performance. While stream fusion is a well-researched topic in hybrid ASR, it remains to be further explored for end-to-end model-based ASR. In this work, striving to achieve optimal fusion in end-to-end ASR, we propose a middle fusion method performing the fusion within the multi-head attention function for the allattention-based encoder-decoder architecture known as the transformer. Using an exemplary single-microphone setting with fusion of standard magnitude and phase features, we achieve a word error rate reduction of 12.1% relative compared to other authors’ benchmarks on the well-known Wall Street Journal (WSJ) task and 9.8% relative compared to the best recently proposed fusion approach.