Analyzing And Improving Neural Speaker Embeddings for ASR

Conference: Speech Communication - 15th ITG Conference
09/20/2023 - 09/22/2023 at Aachen

doi:10.30420/456164040

Proceedings: ITG-Fb. 312: Speech Communication

Pages: 5Language: englishTyp: PDF

Authors:
Luescher, Christoph; Xu, Jingjing; Zeineldeen, Mohammad; Schlueter, Ralf; Ney, Hermann (Machine Learning and Human Language Technology, RWTH Aachen University, Aachen, Germany & AppTek GmbH, Aachen, Germany)

Abstract:
Neural speaker embeddings encode the speaker’s speech characteristics through a DNN model and are prevalent for speaker verification tasks. However, only a few inconclusive studies have investigated the usage of neural speaker embeddings for an ASR system. In this work, we present our efforts w.r.t integrating neural speaker embeddings into a Conformer-based hybrid HMM ASR system. For ASR, our improved embedding extraction pipeline in combination with the Weighted-Simple-Add integration method results in x-vector and c-vector reaching on par performance with i-vectors. We further analyze, compare and combine different speaker embeddings. We improve our already strong baseline by switching to one cycle learning schedule while reducing the training time. By further adding neural speaker embeddings, we gain additional improvements. This results in our best Conformer-based hybrid ASR system with speaker embeddings achieving 9.0% WER on Hub5’00 and Hub5’01 while only training on SWB 300h.