Objective Assessment of a Speech Enhancement Scheme with an Automatic Speech Recognition-Based System

Konferenz: Speech Communication - 13. ITG-Fachtagung Sprachkommunikation
10.10.2018 - 12.10.2018 in Oldenburg, Deutschland

Tagungsband: ITG-Fb. 282: Speech Communication

Seiten: 5Sprache: EnglischTyp: PDF

Persönliche VDE-Mitglieder erhalten auf diesen Artikel 10% Rabatt

Autoren:
Huber, Rainer; Pusch, Arne; Moritz, Niko (Fraunhofer IDMT, Hearing, Speech and Audio Technology and Cluster of Excellence Hearing4All, Oldenburg, Germany)
Rennies, Jan (Fraunhofer IDMT, Hearing, Speech and Audio Technology and Cluster of Excellence Hearing4All, Oldenburg, Germany & Boston University, Department of Speech, Language and Hearing Sciences, Boston, MA, USA)
Schepker, Henning (University of Oldenburg, Department of Medical Physics and Acoustics, Signal Processing Group, and Cluster of Excellence Hearing4All, Oldenburg, Germany)
Meyer, Bernd T. (University of Oldenburg, Department of Medical Physics and Acoustics, Medical Physics Group, and Cluster of Excellence Hearing4All, Oldenburg, Germany)

Inhalt:
A single-ended method for the prediction of perceived listening effort based on an automatic speech recognition system was adopted from the literature and modified to evaluate a near-end listening enhancement (NELE) scheme. The listening effort prediction method employs a deep time delay neural network (TDNN) that was trained as part of an automatic speech recognizer. The TDNN computes phoneme posterior probabilities (or “posteriorgrams”), which degrade in the presence of noise or other distortions. The degree of posteriorgram degradation is quantified by a performance measure and serves as a predictor for mean subjective listening effort ratings of normal-hearing listeners. The modification of the original method consists of the usage of a TDNN (in contrast to a regular feed-forward DNN used before), which was trained on a much bigger speech corpus. Without any task-specific training or optimization, the modified method achieves a very high correlation with subjective listening effort ratings from the used test data set of unprocessed and NELE-processed speech in two types of background noise (r = 0.98), generalizes to unseen noise conditions, and produces consistent predictions across these conditions that can be directly compared.