Multi-Head Fusion Attention for Transformer-Based End-to-End Automatic Speech Recognition

Konferenz: Speech Communication - 14th ITG Conference
29.09.2021 - 01.10.2021 in online

Tagungsband: ITG-Fb. 298: Speech Communication

Seiten: 5Sprache: EnglischTyp: PDF

Persönliche VDE-Mitglieder erhalten auf diesen Artikel 10% Rabatt

Lohrenz, Timo; Schwarz, Patrick; Li, Zhengyang; Fingscheidt, Tim (Institute for Communications Technology, Technische Universität Braunschweig, Braunschweig, Germany)

Stream fusion is a widely used technique in automatic speech recognition (ASR) to explore additional information for a better recognition performance. While stream fusion is a well-researched topic in hybrid ASR, it remains to be further explored for end-to-end model-based ASR. In this work, striving to achieve optimal fusion in end-to-end ASR, we propose a middle fusion method performing the fusion within the multi-head attention function for the allattention-based encoder-decoder architecture known as the transformer. Using an exemplary single-microphone setting with fusion of standard magnitude and phase features, we achieve a word error rate reduction of 12.1% relative compared to other authors’ benchmarks on the well-known Wall Street Journal (WSJ) task and 9.8% relative compared to the best recently proposed fusion approach.