Accurately Capturing Speech Feature Distributions by Extending Supervectors for Robust Speaker Recognition

Konferenz: Speech Communication - 13. ITG-Fachtagung Sprachkommunikation
10.10.2018 - 12.10.2018 in Oldenburg, Deutschland

Tagungsband: ITG-Fb. 282: Speech Communication

Seiten: 5Sprache: EnglischTyp: PDF

Persönliche VDE-Mitglieder erhalten auf diesen Artikel 10% Rabatt

Wilkinghoff, Kevin (Fraunhofer Institute for Communication, Information Processing and Ergonomics FKIE, Fraunhoferstr. 20, 53343 Wachtberg, Germany)

Supervectors represent speaker-specific Gaussian Mixture Models which are enrolled from a Universal Background Model (UBM) and approximate the unknown, underlying speech feature distributions. But as supervectors only consist of the stacked means of the Gaussian components, lowdimensional i-vectors which are derived from them do not completely capture the true feature distributions. In this work, the classical supervectors are extended with additional parameters before reducing their dimension to capture the feature distributions more accurately and complement the i-vectors more effectively. To extend a supervector, the mixture weights, the log-likelihood values of the UBM, a Bhattacharyya-distance based kernel and the Hellinger distance between each enrolled Gaussian component and the corresponding one of the UBM are used. In closed-set speaker identification experiments conducted on the NTIMIT corpus which consists of telephone quality speech, the extended supervectors provide significantly lower error rates than the standard supervectors, even after fusing them with i-vectors and the UBM.