Building a German-centric SpeechLLM Using Limited Data

Conference: Speech Communication - 16th ITG Conference
09/24/2025 - 09/26/2025 at Berlin, Germany

Proceedings: ITG-Fb. 321: Speech Communication

Pages: 5Language: englishTyp: PDF

Authors:
Maurya, Manas; Dethmann, Thomas; Walter, Oliver; Schmidt, Christoph Andreas; Koehler, Joachim

Abstract:
This paper presents a novel approach using German speech data to develop a Speech Large Language Model (Speech- LLM) for processing speech and text inputs. We introduce a data generation process as an alternative to Text-to- Speech for creating a Speech Instruction Following (SIF) training dataset, where we prompt an LLM to generate translations and summaries of speech transcripts and pair them with the corresponding audio file. Combined with original speech data, we train a model for Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST). Despite training only on German speech, our model processes English speech as well, retaining multilingual capabilities from pre-trained components. Evaluation shows reasonable ASR and AST performance given limited training data and demonstrates 0-shot Spoken Question Answering (SQA) capability with potential for future enhancements.