Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain

作者: Dorothea Kolossa , Marc Delcroix , Tomohiro Nakatani , Keisuke Kinoshita , Tsubasa Ochiai

DOI:

关键词:

摘要: Estimating the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization. Both applications benefit from a known position when, instance, applying beamforming assigning unique identities. Recently, several approaches utilizing acoustic signals augmented with visual data have been proposed this task. However, both and modality may corrupted in specific spatial regions, instance due to poor lighting conditions presence background noise. This paper proposes novel audiovisual fusion framework localization by individual dynamic stream weights regions space. is achieved via neural network, which combines predictions audio video trackers based on their time- location-dependent reliability. A performance evaluation using recordings yields promising results, approach outperforming all baseline models.

参考文章(25)
Diederik P. Kingma, Jimmy Ba, Adam: A Method for Stochastic Optimization arXiv: Learning. ,(2014)
Yoshua Bengio, Xavier Glorot, Understanding the difficulty of training deep feedforward neural networks international conference on artificial intelligence and statistics. pp. 249- 256 ,(2010)
Israel D. Gebru, Xavier Alameda-Pineda, Radu Horaud, Florence Forbes, Audio-visual speaker localization via weighted clustering international workshop on machine learning for signal processing. pp. 1- 6 ,(2014) , 10.1109/MLSP.2014.6958874
Tamer Basar, A New Approach to Linear Filtering and Prediction Problems Journal of Basic Engineering. ,vol. 82, pp. 35- 45 ,(1960) , 10.1115/1.3662552
C. Busso, S. Hernanz, Chi-Wei Chu, Soon-il Kwon, Sung Uk Lee, P.G. Georgiou, I. Cohen, S. Narayanan, Smart room: participant and speaker localization and identification international conference on acoustics, speech, and signal processing. ,vol. 2, pp. 1117- 1120 ,(2005) , 10.1109/ICASSP.2005.1415605
Wangyan Li, Zidong Wang, Guoliang Wei, Lifeng Ma, Jun Hu, Derui Ding, A Survey on Multisensor Fusion and Consensus Filtering for Sensor Networks Discrete Dynamics in Nature and Society. ,vol. 2015, pp. 1- 12 ,(2015) , 10.1155/2015/683701
Daniel Gatica-Perez, Guillaume Lathoud, Jean-Marc Odobez, Iain McCowan, Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings IEEE Transactions on Audio, Speech, and Language Processing. ,vol. 15, pp. 601- 616 ,(2007) , 10.1109/TASL.2006.881678
D. Gatica-Perez, G. Lathoud, I. McCowan, J.-M. Odobez, D. Moore, Audio-visual speaker tracking with importance particle filters international conference on image processing. ,vol. 3, pp. 25- 28 ,(2003) , 10.1109/ICIP.2003.1247172
K. Nakadai, K. Hidai, H.G. Okuno, H. Kitano, Real-time speaker localization and speech separation by audio-visual integration international conference on robotics and automation. ,vol. 1, pp. 1043- 1049 ,(2002) , 10.1109/ROBOT.2002.1013493
M. Kam, Xiaoxun Zhu, P. Kalata, Sensor fusion for mobile robot navigation Proceedings of the IEEE. ,vol. 85, pp. 108- 119 ,(1997) , 10.1109/JPROC.1997.554212