Audio-visual speaker localization via weighted clustering

作者: Israel D. Gebru , Xavier Alameda-Pineda , Radu Horaud , Florence Forbes

DOI: 10.1109/MLSP.2014.6958874

关键词:

摘要: In this paper we address the problem of detecting and locating speakers using audiovisual data. We in framework clustering. propose a novel weighted clustering method based on finite mixture model which explores idea non-uniform weighting observations. Weighted-data techniques have already been proposed, but not generative setting as presented here. introduce weighted-data formally devise associated EM procedure. The algorithm is applied to localizing speaker over time both visual auditory observations gathered with single camera two microphones. Audiovisual fusion enforced by introducing cross-modal scheme. test robustness experiments challenging scenarios: disambiguate between an active non-active speaker, associate speech signal person.

参考文章(16)
Antoine Deleforge, Laurent Girin, Vincent Drouard, Radu Horaud, Mapping sounds onto images using binaural spectrograms european signal processing conference. pp. 2470- 2474 ,(2014)
Matthew J. Beal, Hagai Attias, Nebojsa Jojic, Audio-Video Sensor Fusion with Probabilistic Graphical Models european conference on computer vision. pp. 736- 752 ,(2002) , 10.1007/3-540-47969-4_49
A. Noulas, G. Englebienne, B. J. A. Krose, Multimodal Speaker Diarization IEEE Transactions on Pattern Analysis and Machine Intelligence. ,vol. 34, pp. 79- 93 ,(2012) , 10.1109/TPAMI.2011.47
Xavier Alameda-Pineda, Vasil Khalidov, Radu Horaud, Florence Forbes, Finding audio-visual events in informal social gatherings international conference on multimodal interfaces. pp. 247- 254 ,(2011) , 10.1145/2070481.2070527
Vasil Khalidov, Florence Forbes, Radu Horaud, Conjugate mixture models for clustering multimodal data Neural Computation. ,vol. 23, pp. 517- 557 ,(2011) , 10.1162/NECO_A_00074
Vittorio Ferrari, Manuel Marin-Jimenez, Andrew Zisserman, Progressive search space reduction for human pose estimation computer vision and pattern recognition. pp. 1- 8 ,(2008) , 10.1109/CVPR.2008.4587468
Xiangxin Zhu, D. Ramanan, Face detection, pose estimation, and landmark localization in the wild computer vision and pattern recognition. pp. 2879- 2886 ,(2012) , 10.1109/CVPR.2012.6248014
T. Butz, J.-P. Thiran, Feature space mutual information in speech-video sequences international conference on multimedia and expo. ,vol. 2, pp. 361- 364 ,(2002) , 10.1109/ICME.2002.1035605
Z. Barzelay, Y.Y. Schechner, Onsets Coincidence for Cross-Modal Analysis IEEE Transactions on Multimedia. ,vol. 12, pp. 108- 120 ,(2010) , 10.1109/TMM.2009.2037387
Michel Dojat, Senan Doyle, Christian Barillot, Florence Forbes, Daniel García-Lorenzo, A Weighted Multi-Sequence Markov Model For Brain Lesion Segmentation international conference on artificial intelligence and statistics. ,vol. 9, pp. 225- 232 ,(2010)