Image Transformation based Features for the Visual Discrimination of Prominent and Non-ProminentWords

Konferenz: Sprachkommunikation - Beiträge zur 10. ITG-Fachtagung
26.09.2012 - 28.09.2012 in Braunschweig, Deutschland

Tagungsband: Sprachkommunikation

Seiten: 4Sprache: EnglischTyp: PDF

Persönliche VDE-Mitglieder erhalten auf diesen Artikel 10% Rabatt

Martin Heckmann (Honda Research Institute Europe GmbH, 63073 Offenbach/Main, Germany)

This paper investigates how visual information extracted from a speaker’s mouth region can be used to discriminate prominent from non-prominent words. The analysis relies on a database where users interacted in a small game with a computer in a Wizard of Oz experiment. Users were instructed to correct recognition errors of the system. This was expected to render the corrected word highly prominent. Audio-visual recordings with a distant microphone and without visual markers were made. As acoustic features relative energy and fundamental frequency were calculated. From the visual channel image transformation based features from the mouth region were extracted. As image transformations FFT, DCT and PCA with a varying number of coefficients are compared in this paper. Thereby the performance of the visual features by themselves or in combination with the acoustic features is investigated. The comparison is based on the classification with a Support Vector Machine (SVM). The results show that all three image transformations yield a performance of approx. 65% in this binary classification task. Furthermore, the information extracted from the visual channel is complementary to the acoustic information. The combination of both modalities significantly improves performance up to approx. 80%.