Compression of end-to-end non-autoregressive image-to-speech system for lowresourced devices

Konferenz: Speech Communication - 15th ITG Conference
20.09.2023-22.09.2023 in Aachen

doi:10.30420/456164029

Tagungsband: ITG-Fb. 312: Speech Communication

Seiten: 5Sprache: EnglischTyp: PDF

Autoren:
Srinivasagan, Gokul (Saarland University, Saarbrucken, Germany & Intel Corporation, Hillsboro, Oregon, USA)
Deisher, Michael (Intel Corporation, Hillsboro, Oregon, USA)
Georges, Munir (Intel Labs, Munich, Germany & Technische Hochschule Ingolstadt, Germany)

Inhalt:
People with visual impairments have difficulty accessing touchscreen-enabled personal computing devices like mobile phones and laptops. The image-to-speech (ITS) systems can assist them in mitigating this problem, but their huge model size makes it extremely hard to be deployed on low-resourced embedded devices. In this paper, we aim to overcome this challenge by developing an efficient endto- end neural architecture for generating audio from tiny segments of display content on low-resource devices. We introduced a vision transformers-based image encoder and utilized knowledge distillation to compress the model from 6.1 million to 2.46 million parameters. Human and automatic evaluation results show that our approach leads to a very minimal drop in performance and can speed up the inference time by 22%.