Toward Semi-supervised Transcription of NAKO+ILSE: Influence of Automatic Speech Recognition Performance on Manual Transcription Effort

Konferenz: Speech Communication - 15th ITG Conference
20.09.2023-22.09.2023 in Aachen

doi:10.30420/456164020

Tagungsband: ITG-Fb. 312: Speech Communication

Seiten: 5Sprache: EnglischTyp: PDF

Autoren:
Brausse, Elisa; Scheck, Kevin; Schultz, Tanja (Cognitive Systems Lab, University of Bremen, Germany)

Inhalt:
Acoustic and linguistic information of spoken communication were found to be a reliable estimator for early detection of cognitive decline. The extraction of linguistic features, however, requires transcription from spoken content to text, either manually or by automatic speech recognition (ASR). We propose a semi-automatic transcription system using manual correction of ASR output. It is applied to our new project, with which we envision to fuse information of the Interdisciplinary Longitudinal Study of Adult Development and Aging (ILSE) with the German National Cohort (NAKO) study by creating a core overlap dataset NAKO+ILSE. We compare the performance of our ASR system with the zero-shot Whisper system on ILSE interview data and NAKO+ILSE test data NI0. Due to differences in ASR performance between data sets, we analyze the effect of noise levels on ASR performance. Lastly, we analyze the influence of using ASR hypotheses as basis for manual transcriptions. With a minimum word error rate of 45.1 %, pre-trained Whisper models do not outperform our own ASR system of 33.55 %. However, manual transcription time is reduced by a factor of three when using Whisper large as a basis for semi-automatic transcription.