The Impact of Word Alignment Accuracy on Audio-visual Word Prominence Detection

Konferenz: Speech Communication - 11. ITG-Fachtagung Sprachkommunikation
24.09.2014 - 26.09.2014 in Erlangen, Deutschland

Tagungsband: Speech Communication

Seiten: 4Sprache: EnglischTyp: PDF

Persönliche VDE-Mitglieder erhalten auf diesen Artikel 10% Rabatt

Autoren:
Heckmann, Martin (Honda Research Institute Europe GmbH, 63073 Offenbach Main, Germany)
Mikias, Paschalis; Kolossa, Dorothea (Cognitive Signal Processing Group, Ruhr-Universitaet Bochum, 44780 Bochum, Germany)

Inhalt:
To automatically detect prominent syllables or words, most approaches require a segmentation of the speech signal and a subsequent extraction of prosodic features in these segments. In this paper we investigate the impact of the precision of this segmentation on the detection. We perform the segmentation of our audiovisual prosodically rich corpus based on an HMM trained on a large dataset. Thereby, we investigate different training strategies of the HMM. We consider on one hand training without any prior information, i.e. flat start and on the other hand when using partially manually created segmentations. Additionally we also introduce features tailored to detect onsets in the spectrogram. We evaluate the performance of the segmentation on our corpus on one hand by comparing it to manual annotations and on the other hand functionally, i. e. via the impact on the prominent word detection. The results show that the use of manual annotations in the training and the onset features significantly improve the segmentation accuracy. Yet the results of the prominent word detection do not to benefit from the better segmentation. From this we conclude that the extraction of the prosodic features is robust against segmentation errors.