Audio-visual speaker localization via weighted clustering

作者： Israel D. Gebru , Xavier Alameda-Pineda , Radu Horaud , Florence Forbes

DOI: 10.1109/MLSP.2014.6958874

关键词:

摘要: In this paper we address the problem of detecting and locating speakers using audiovisual data. We in framework clustering. propose a novel weighted clustering method based on finite mixture model which explores idea non-uniform weighting observations. Weighted-data techniques have already been proposed, but not generative setting as presented here. introduce weighted-data formally devise associated EM procedure. The algorithm is applied to localizing speaker over time both visual auditory observations gathered with single camera two microphones. Audiovisual fusion enforced by introducing cross-modal scheme. test robustness experiments challenging scenarios: disambiguate between an active non-active speaker, associate speech signal person.

archives-ouvertes.fr 本地加速

uni-trier.de 本地加速

archives-ouvertes.fr 本地加速

cnrs.fr 本地加速

archives-ouvertes.fr PDF 下载加速

doi.org PDF 下载加速

ieee.org LINK 下载加速

sci-hub.se PDF 下载加速

参考文章(16)

Antoine Deleforge, Laurent Girin, Vincent Drouard, Radu Horaud, Mapping sounds onto images using binaural spectrograms european signal processing conference. pp. 2470- 2474 ,(2014)

Matthew J. Beal, Hagai Attias, Nebojsa Jojic, Audio-Video Sensor Fusion with Probabilistic Graphical Models european conference on computer vision. pp. 736- 752 ,(2002) , 10.1007/3-540-47969-4_49

A. Noulas, G. Englebienne, B. J. A. Krose, Multimodal Speaker Diarization IEEE Transactions on Pattern Analysis and Machine Intelligence. ,vol. 34, pp. 79- 93 ,(2012) , 10.1109/TPAMI.2011.47

Xavier Alameda-Pineda, Vasil Khalidov, Radu Horaud, Florence Forbes, Finding audio-visual events in informal social gatherings international conference on multimodal interfaces. pp. 247- 254 ,(2011) , 10.1145/2070481.2070527

Vasil Khalidov, Florence Forbes, Radu Horaud, Conjugate mixture models for clustering multimodal data Neural Computation. ,vol. 23, pp. 517- 557 ,(2011) , 10.1162/NECO_A_00074

Vittorio Ferrari, Manuel Marin-Jimenez, Andrew Zisserman, Progressive search space reduction for human pose estimation computer vision and pattern recognition. pp. 1- 8 ,(2008) , 10.1109/CVPR.2008.4587468

Xiangxin Zhu, D. Ramanan, Face detection, pose estimation, and landmark localization in the wild computer vision and pattern recognition. pp. 2879- 2886 ,(2012) , 10.1109/CVPR.2012.6248014

T. Butz, J.-P. Thiran, Feature space mutual information in speech-video sequences international conference on multimedia and expo. ,vol. 2, pp. 361- 364 ,(2002) , 10.1109/ICME.2002.1035605

Z. Barzelay, Y.Y. Schechner, Onsets Coincidence for Cross-Modal Analysis IEEE Transactions on Multimedia. ,vol. 12, pp. 108- 120 ,(2010) , 10.1109/TMM.2009.2037387

10.

Michel Dojat, Senan Doyle, Christian Barillot, Florence Forbes, Daniel García-Lorenzo, A Weighted Multi-Sequence Markov Model For Brain Lesion Segmentation international conference on artificial intelligence and statistics. ,vol. 9, pp. 225- 232 ,(2010)

Audio-visual speaker localization via weighted clustering

来源期刊

我的账户

Audio-visual speaker localization via weighted clustering

来源期刊

相似文章 7

A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation

Finding Time Together: Detection and Classification of Focused Interaction in Egocentric Video

A Hybrid Approach for Speaker Tracking Based on TDOA and Data-Driven Models

Multimodal Egocentric Analysis of Focused Interactions

Bio-Inspired Modality Fusion for Active Speaker Detection

Active Speaker Detection and Localization in Videos Using Low-Rank and Kernelized Sparsity

Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain

我的账户