Uncertainty-Driven Hybrid Fusion for Audio-Visual Phoneme Recognition

Konferenz: Speech Communication - 15th ITG Conference
20.09.2023-22.09.2023 in Aachen

doi:10.30420/456164050

Tagungsband: ITG-Fb. 312: Speech Communication

Seiten: 5Sprache: EnglischTyp: PDF

Autoren:
Fang, Huajian; Gerkmann, Timo (Signal Processing, Universität Hamburg, Germany)
Frintrop, Simone (Computer Vision, Universität Hamburg, Germany)

Inhalt:
For several speech-processing tasks, complementary features from the visual modality may improve model performance. However, unreliable visual input may provide misleading information, resulting in degraded performance that may be even worse than methods based solely on the audio modality. In this work, we propose an uncertainty-driven hybrid fusion scheme for audio-visual phoneme recognition, mitigating the impact of an unreliable visual modality. More specifically, we incorporate modality-wise uncertainty into decision-making, enabling the model to adaptively determine whether to combine multiple modalities and the extent to which the decision depends on each modality. Experimental results show that the proposed uncertainty-driven hybrid fusion scheme retains the benefits of multi-modal approaches when visual inputs are clean and informative, while at the same time being robust to visual modality distortions.