Coordinate Mapping Between an Acoustic and Visual Sensor Network in the Shape Domain for a Joint Self-Calibrating Speaker Tracking

Conference: Speech Communication - 11. ITG-Fachtagung Sprachkommunikation
09/24/2014 - 09/26/2014 at Erlangen, Deutschland

Proceedings: Speech Communication

Pages: 4Language: englishTyp: PDF

Personal VDE Members are entitled to a 10% discount on this title

Authors:
Jacob, Florian; Haeb-Umbach, Reinhold (Department of Communications Engineering, University of Paderborn, 33098 Paderborn, Germany)

Abstract:
Several self-localization algorithms have been proposed, that determine the positions of either acoustic or visual sensors autonomously. Usually these positions are given in a modality specific coordinate system, with an unknown rotation, translation and scale between the different systems. For a joint audiovisual tracking, where the different modalities support each other, the two modalities need to be mapped into a common coordinate system. In this paper we propose to estimate this mapping based on audiovisual correlates, i.e., a speaker that can be localized by both, a microphone and a camera network, separately. The voice is tracked by a microphone network, which had to be calibrated by a self-localization algorithm at first, and the head is tracked by a calibrated camera network. Unlike existing Singular Value Decomposition based approaches to estimate the coordinate system mapping, we propose to perform an estimation in the shape domain, which turns out to be computationally more efficient. The estimation process is embedded into a Random Sample Consensus (RANASC) framework to obtain a noise robust mapping. Simulations of the self-localization of an acoustic sensor network and a following coordinate mapping for a joint speaker localization showed a significant improvement of the localization performance, since the modalities were able to support each other.