Age and Gender Recognition Based on Multiple Systems - Early vs. Late Fusion

作者: Georg Stemmer , Elmar Nöth , Viktor Zeißler , Tobias Bocklet

DOI:

关键词: Speech recognitionFeature vectorFeature (machine learning)Set (psychology)Pattern recognitionComputer scienceUtteranceLogistic regressionFusionMixture modelMel-frequency cepstrumArtificial intelligence

摘要: This paper focuses on the automatic recognition of a person’s age and gender based only his or her voice. Up to five different systems are compared combined in configurations: three model speaker’s characteristics feature spaces, i.e., MFCC, PLP, TRAPS, by Gaussian mixture models. The features these concatenated mean vectors. System number 4 uses physical two-mass vocal estimates data-driven optimization procedure 9 glottal from voiced speech sections. For each utterance minimum, maximum vectors form 27-dimensional vector. last system calculates 219-dimensional prosodic set for voice unvoiced segments. We compare two ways fuse systems: First, we concatenate level. second way combination is performed score level multi-class logistic regression. Despite there just minor differences between approaches, late fusion slightly superior. On development Interspeech Agender challenge achieved an unweighted recall 46.1% with early 47.8% fusion. Index Terms: acoustic analysis, classification, models

参考文章(12)
Fabio Brugnara, Georg Stemmer, Christian Hacker, Florian Hönig, Revising Perceptual Linear Prediction (PLP). conference of the international speech communication association. pp. 2997- 3000 ,(2005)
Yaniv Zigel, Gil Dobry, Ron M. Hecht, Mireille Avigal, Dimension reduction approaches for SVM based speaker age estimation. conference of the international speech communication association. pp. 2031- 2034 ,(2009)
Laurence Devillers, Björn W. Schuller, Stefan Steidl, Felix Burkhardt, Shrikanth S. Narayanan, Anton Batliner, Christian A. Müller, The INTERSPEECH 2010 Paralinguistic Challenge conference of the international speech communication association. pp. 2794- 2797 ,(2010)
Florian Hönig, Elmar Nöth, Viktor Zeißler, Peter Ackermann, Andreas K. Maier, Nobuyuki Yamanaka, Erik Körner, Anton Batliner, A Language-Independent Feature Set for the Automatic Evaluation of Prosody conference of the international speech communication association. pp. 600- 603 ,(2009)
Hynek Hermansky, Sangita Sharma, TRAPS - classifiers of temporal patterns. conference of the international speech communication association. ,(1998)
Wolfgang Wahlster, None, Verbmobil : foundations of speech-to-speech translation Springer Berlin Heidelberg. ,(2000) , 10.1007/978-3-662-04230-4
Sue Ellen Linville, The sound of senescence Journal of Voice. ,vol. 10, pp. 190- 200 ,(1996) , 10.1016/S0892-1997(96)80046-4
Donald M. Olsson, Lloyd S. Nelson, The Nelder-Mead Simplex Procedure for Function Minimization Technometrics. ,vol. 17, pp. 45- 51 ,(1975) , 10.1080/00401706.1975.10489269
Hynek Hermansky, Perceptual linear predictive (PLP) analysis of speech Journal of the Acoustical Society of America. ,vol. 87, pp. 1738- 1752 ,(1990) , 10.1121/1.399423
Tobias Bocklet, Andreas Maier, Josef G. Bauer, Felix Burkhardt, Elmar Noth, Age and gender recognition for telephone applications based on GMM supervectors and support vector machines international conference on acoustics, speech, and signal processing. pp. 1605- 1608 ,(2008) , 10.1109/ICASSP.2008.4517932