Finding Difficult Speakers in Automatic Speaker Recognition

作者: Lara Lynn Stoll

DOI:

关键词: Distance measuresFalse alarmPopulationSpeaker diarisationSpeaker recognitionFormantPrecision and recallSupport vector machineSpeech recognitionComputer science

摘要: The task of automatic speaker recognition, wherein a system verifies or determines speaker's identity using sample speech, has been studied for few decades. In that time, great deal progress made in improving the accuracy system's decisions, through use more successful machine learning algorithms, and application channel compensation techniques other methodologies aimed at addressing sources errors such as noise data mismatch. general, can be expected to have one causes, involving both intrinsic extrinsic factors. Extrinsic factors correspond external influences, including reverberation, noise, microphone effects. Intrinsic relate inherently himself, include sex, age, dialect, accent, emotion, speaking style, voice characteristics. This dissertation focuses on relatively unexplored issue dependence particular, I investigate phenomenon some speakers within given population tendency cause large proportion errors, explore ways finding speakers.There are two main components this thesis. first, establish performance speakers, building upon expanding previous work demonstrating existence with tendencies false alarm rejection errors. To end, different sets: is an older collection telephone conversational recent speech recorded variety channels, telephone, well various types microphones. Furthermore, addition considering traditional recognition approach, second set utilize outputs contemporary approach better able handle variations channel. results analysis repeatedly show behavior across true impostor cases. Variation occurs level utterances, depend which his utterances used, level, overall Additionally, lamb-ish (where tends produce alarms target) correlated wolf-ish impostor). On set, 50% caused by only 15-25% speakers.The component thesis investigates straightforward predict will difficult correctly recognize. features calculate feature statistics then used compute measure similarity between pairs. By ranking these measures pairs, determine those pairs easy distinguish difficult-to-distinguish. A simple distance could successfully select easy- difficult-to-distinguish evaluated differences detection cost probability number systems. Of tested, best feature-measure most least was Euclidean vectors mean second, third formant frequencies. Even greater success attained Kullback-Liebler (KL) divergence speaker-specific GMMs. examination smallest biggest distances (as computed KL divergence) revealed individual consistently fall among (or least) pairs.I develop who system, calculated over regions speech. support vector (SVM) classifier trained examples, order difficulty target impostor. resulting precision recall were 0.8 detection, 0.7 detection. Depending application, threshold tuned improve precision, recall, specificity suit needs particular task. same taken single conversation sides, sides corresponding speaker, since input any samples.

参考文章(61)
Anders Eriksson, Pär Wretling, How flexible is the human voice? - a case study of mimicry. conference of the international speech communication association. ,(1997)
Hagai Aronowitz, Jacob Goldberger, A distance measure between GMMs based on the unscented transform and its application to speaker recognition. conference of the international speech communication association. pp. 1985- 1988 ,(2005)
Benoit G. B. Fauve, Jean-François Bonastre, Driss Matrouf, Nicolas Scheffer, A Straightforward and Efficient Implementation of the Factor Analysis Model for Speaker Verification conference of the international speech communication association. pp. 1242- 1245 ,(2007)
Pascual Ejarque, Mireia Farrús, Javier Hernando, Jitter and shimmer measurements for speaker recognition conference of the international speech communication association. pp. 778- 781 ,(2007)
Volker Dellwo, Mark Huckvale, Michael Ashby, How Is Individuality Expressed in Voice? An Introduction to Speech Production and Description for Speaker Classification Lecture Notes in Computer Science. pp. 1- 20 ,(2007) , 10.1007/978-3-540-74200-5_1
Qin Jin, Alex Waibel, A na ve de-lambing method for speaker identification. conference of the international speech communication association. pp. 466- 469 ,(2000)
Ben Shahshahani, Remco Teunen, Larry P. Heck, A model-based transformational approach to robust speaker recognition. conference of the international speech communication association. pp. 495- 498 ,(2000)
Christopher Cieri, Kevin Walker, David Miller, The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text. language resources and evaluation. ,(2004)
Christopher Cieri, Kevin Walker, David Graff, Linda Corson, Resources for new research directions in speaker recognition: the mixer 3, 4 and 5 corpora. conference of the international speech communication association. pp. 950- 953 ,(2007)
George R. Doddington, Speaker recognition based on idiolectal differences between speakers. conference of the international speech communication association. pp. 2521- 2524 ,(2001)