Multistream speaker diarization of meetings recordings beyond MFCC and TDOA features

作者: Deepu Vijayasenan , Fabio Valente , Hervé Bourlard

DOI: 10.1016/J.SPECOM.2011.07.001

关键词:

摘要: Many state-of-the-art diarization systems for meeting recordings are based on the HMM/GMM framework and combination of spectral (MFCC) time delay arrivals (TDOA) features. This paper presents an extensive study how multistream can be improved beyond these two sets While several other features have been proven effective speaker diarization, little efforts devoted to integrate them into MFCC+TDOA baseline authors' best knowledge, no positive results reported so far. The first contribution this consists in analyzing reasons this, investigating through a set oracle experiments robustness when also (the modulation spectrum frequency domain linear prediction features) integrated. second introducing non-parametric method information bottleneck (IB) approach. In contrary which makes use log-likelihood combination, it combines feature streams normalized space relevance variables. previous analysis is repeated revealing that proposed approach more robust actually benefit from sources conventional MFCC TDOA Experiments rich transcription data (heterogeneous meetings recorded different rooms) show achieves very competitive error only 6.3% four used, compared 14.9% system. Those analyzed terms sensitivity stream weightings. To knowledge successful attempt reduce combining with shortcomings going baseline. As last contribution, addresses issues related computational complexity approaches.

参考文章(29)
S. Chen, Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion Proc. DARPA Broadcast News Transcription and Understanding Workshop, 1998. ,(1998)
David A. van Leeuwen, Matej Konečný, Progress in the AMIDA Speaker Diarization System for Meeting Data Multimodal Technologies for Perception of Humans. pp. 475- 483 ,(2008) , 10.1007/978-3-540-68585-2_44
Gerald Friedland, Oriol Vinyals, Modulation spectrogram features for improved speaker diarization. conference of the international speech communication association. pp. 630- 633 ,(2008)
Xavier Anguera Miró, ROBUST SPEAKER DIARIZATION FOR MEETINGS TDX (Tesis Doctorals en Xarxa). ,(2006)
Jitendra Ajmera, Iain A McCowan, Hervé Bourlard, Robust audio segmentation École Polytechnique Fédérale de Lausanne. ,(2004) , 10.5075/EPFL-THESIS-3022
Guillermo Aradilla, Acoustic Models for Posterior Features in Speech Recognition Ecole Polytechnique Fédérale de Lausanne. ,(2008) , 10.5075/EPFL-THESIS-4164
Hynek Hermansky, Sriram Ganapathy, Samuel Thomas, Front-end for Far-field Speech Recognition based on Frequency Domain Linear Prediction conference of the international speech communication association. pp. 984- 987 ,(2008)
Hervé Bourlard, Fabio Valente, Deepu Vijayasenan, KL Realignment for Speaker Diarization with Multiple Feature Streams conference of the international speech communication association. pp. 1059- 1062 ,(2009)
Po-Chuan Lin, Jia-Ching Wang, Jhing-Fa Wang, Hao-Ching Sung, Unsupervised speaker change detection using SVM training misclassification rate IEEE Transactions on Computers. ,vol. 56, pp. 1212- 1244 ,(2007) , 10.1109/TC.2007.70746
Athanasios Noulas, Ben J. A. Krose, On-line multi-modal speaker diarization Proceedings of the ninth international conference on Multimodal interfaces - ICMI '07. pp. 350- 357 ,(2007) , 10.1145/1322192.1322254