作者: Dorothea Kolossa , Marc Delcroix , Tomohiro Nakatani , Keisuke Kinoshita , Tsubasa Ochiai
DOI:
关键词:
摘要: Estimating the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization. Both applications benefit from a known position when, instance, applying beamforming assigning unique identities. Recently, several approaches utilizing acoustic signals augmented with visual data have been proposed this task. However, both and modality may corrupted in specific spatial regions, instance due to poor lighting conditions presence background noise. This paper proposes novel audiovisual fusion framework localization by individual dynamic stream weights regions space. is achieved via neural network, which combines predictions audio video trackers based on their time- location-dependent reliability. A performance evaluation using recordings yields promising results, approach outperforming all baseline models.