Using Perceptual Evaluation of Speech Quality (PESQ) Loss for DNN-Based Speech Enhancement

Konferenz: Speech Communication - 15th ITG Conference
20.09.2023-22.09.2023 in Aachen

doi:10.30420/456164011

Tagungsband: ITG-Fb. 312: Speech Communication

Seiten: 5Sprache: EnglischTyp: PDF

Autoren:
Thieling, Lars; Nippert, Lars; Jax, Peter (Institute of Communication Systems (IKS), RWTH Aachen University, Aachen, Germany)

Inhalt:
In deep neural network (DNN)-based speech enhancement approaches, standard regression losses such as the mean squared error (MSE) are often utilized for training. However, these losses typically do not consider human perception and therefore may not lead to good perceptual quality. In this work, we implement a PESQLoss function that approximates the popular perceptual evaluation of speech quality (PESQ) metric. We propose modifications to our existing phase-aware deep speech enhancement approach that enable joint optimization of magnitude and phase estimates using this PESQLoss. By varying the weight of the PESQLoss as an additional term in our total loss, we investigate its influence on the achieved evaluation metrics. Moreover, we present a suppression measure allowing better interpretation of its influence on the estimation results. Our experiments show that the proposed changes for joint optimization lead to an average improvement of about 0.28 MOS w.r.t. PESQ, while achieving similar results for the other metrics (STOI, segmental SNR, DNSMOS).