Stream-ETS: Low-latency End-to-end Speech Synthesis from Electromyography Signals

Konferenz: Speech Communication - 15th ITG Conference
20.09.2023-22.09.2023 in Aachen

doi:10.30420/456164039

Tagungsband: ITG-Fb. 312: Speech Communication

Seiten: 5Sprache: EnglischTyp: PDF

Autoren:
Scheck, Kevin; Ivucic, Darius; Ren, Zhao; Schultz, Tanja (Cognitive Systems Lab, University of Bremen, Germany)

Inhalt:
The electromyographic activity of articulatory muscles provides information about the speech production process. As such, Electromyography (EMG) signals are investigated for speech communication methods without acoustic speech in the context of Silent Speech Interfaces. For this, EMG-to-Speech (ETS) models predict acoustic speech from EMG signals captured during articulation. In this work, we propose Stream-ETS, a streamable end-to-end ETS system. Its architecture consists of a causal EMG encoder, processing EMG signals to Mel-spectrograms, and a causal neural vocoder, which predicts the acoustic speech signal. Using a GPU, Stream-ETS outputs acoustic speech from 10 millisecond chunks of EMG in approx. 8 milliseconds, making the system perform in real-time with a low-latency. We first pre-train both components and then perform end-to-end fine-tuning. Experiments indicate that end-to-end training increases the naturalness of the speech synthesis.