Speaker diarization and tracking in multiple-sensor environments

作者: Jordi Luque Serrano

DOI:

关键词:

摘要: This thesis verses about the research conducted in topic of speaker recognition real conditions like as meeting rooms, telephone quality speech and radio TV broadcast news. The main objective is concerned to automatic detection classification speakers into a smart-room scenario. Acoustic application machine identify an individual from spoken sentence. It aims at processing acoustic signals convert them symbolic descriptions corresponding identity speakers. For last several years, situation has been attracting substantial attention becoming one language technologies adding improvement, or enrichment, recording transcriptions. In particularly, human activity that takes place meeting-rooms class-rooms, compared other domains exhibits increased complexity challenging problem due spontaneity speech, reverberation effects, presence overlapped room setup channel variability rich assortment events, either produced by humans objects handled them. Therefore, determination both their position time may help detect describe provide context awareness. We first seek improve traditional modeling approaches for identification verification, which are based on Gaussian Mixture Models, through multi-decision multi-channel strategies, scenario. We put emphasis studying techniques such Maximum Posteriori Adaptation, Nuisance Attribute Projection, Joint Factor Analysis, score normalization; aiming find out strategies deal with drawback. Moreover, we novel verification algorithm makes use adapted features recognition. In second line research, related continuous audio stream, where optimum number identities unknown priory. developed some previous baseline diarization system upon Hidden Markov Models Agglomerative Hierarchical Clustering. evaluate TDOA feature dynamics order clustering initialization AHC handling overlaps; assess impact synergies Speech Activity Detection Acoustic Event integrated system; propose compare new methods spectral clustering. adaptation news domain tracking task also addressed. Finally, fusion combination video image modalities highlighted across this work, approaches. Techniques Matching Weighting Particle Filter proposed combine scores likelihoods different modalities. Results provided demonstrate these information sources can play important role person task, complementary knowledge spectrum-based systems thus improving accuracy. This work was performed framework international national projects, among CHIL EU project Catalan founded Tecnoparla; participation technology evaluations CLEAR, NIST Rich Transcription (RT), Speaker Recognition Evaluation (SRE) Spanish evaluation Albayzin.

参考文章(228)
Josep R. Casas, Joachim Neumann, Context Awareness Triggered by Multiple Perceptual Analyzers Proceedings of the 2007 conference on Emerging Artificial Intelligence Applications in Computer Engineering: Real Word AI Systems with Applications in eHealth, HCI, Information Retrieval and Pervasive Technologies. pp. 371- 383 ,(2007)
Gilles Adda, Lori Lamel, Jean-Luc Gauvain, Partitioning and transcription of broadcast news data. conference of the international speech communication association. ,(1998)
Guillaume Gravier, Daniel Moraru, Mathieu Ben, Experiments on speaker tracking and segmentation in radio broadcast news. conference of the international speech communication association. pp. 3049- 3052 ,(2005)
Francis Kubala, Daben Liu, Fast speaker change detection for broadcast news transcription and indexing. conference of the international speech communication association. pp. 1031- 1034 ,(1999)
Xavier Anguera Miró, Javier Hernando Pericas, Evolutive speaker segmentation using a repository system. conference of the international speech communication association. ,(2004)
Elias Rentzeperis, Andreas Stergiou, Christos Boukis, Aristodemos Pnevmatikakis, Lazaros C. Polymenakos, The 2006 Athens Information Technology Speech Activity Detection and Speaker Diarization Systems Machine Learning for Multimodal Interaction. pp. 385- 395 ,(2006) , 10.1007/11965152_34
Frantisek Grézl, Hynek Hermansky, Sunil Sivadas, Lukás Burget, André Gustavo Adami, Pratibha Jain, Nelson Morgan, Harinath Garudadri, Sachin S. Kajarekar, Stéphane Dupont, Qualcomm-ICSI-OGI features for ASR. conference of the international speech communication association. ,(2002)
Kofi Boakye, Gerald Friedland, Oriol Vinyals, Two's a Crowd: Improving Speaker Diarization by Automatically Identifying and Excluding Overlapped Speech conference of the international speech communication association. pp. 32- 35 ,(2008)
Hao Tang, Huazhong Ning, Thomas S. Huang, Ming Liu, A Spectral Clustering Approach to Speaker Diarization conference of the international speech communication association. pp. 2178- 2181 ,(2006)
Pierre Ouellet, Patrick Kenny, Gilles Boulianne, Flavors of Gaussian warping. conference of the international speech communication association. pp. 2957- 2960 ,(2005)