Building a German-centric SpeechLLM Using Limited Data

Konferenz: Speech Communication - 16th ITG Conference
24.09.2025-26.09.2025 in Berlin, Germany

Tagungsband: ITG-Fb. 321: Speech Communication

Seiten: 5Sprache: EnglischTyp: PDF

Autoren:
Maurya, Manas; Dethmann, Thomas; Walter, Oliver; Schmidt, Christoph Andreas; Koehler, Joachim

Inhalt:
This paper presents a novel approach using German speech data to develop a Speech Large Language Model (Speech- LLM) for processing speech and text inputs. We introduce a data generation process as an alternative to Text-to- Speech for creating a Speech Instruction Following (SIF) training dataset, where we prompt an LLM to generate translations and summaries of speech transcripts and pair them with the corresponding audio file. Combined with original speech data, we train a model for Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST). Despite training only on German speech, our model processes English speech as well, retaining multilingual capabilities from pre-trained components. Evaluation shows reasonable ASR and AST performance given limited training data and demonstrates 0-shot Spoken Question Answering (SQA) capability with potential for future enhancements.