作者: Georg Stemmer , Elmar Nöth , Viktor Zeißler , Tobias Bocklet
DOI:
关键词: Speech recognition 、 Feature vector 、 Feature (machine learning) 、 Set (psychology) 、 Pattern recognition 、 Computer science 、 Utterance 、 Logistic regression 、 Fusion 、 Mixture model 、 Mel-frequency cepstrum 、 Artificial intelligence
摘要: This paper focuses on the automatic recognition of a person’s age and gender based only his or her voice. Up to five different systems are compared combined in configurations: three model speaker’s characteristics feature spaces, i.e., MFCC, PLP, TRAPS, by Gaussian mixture models. The features these concatenated mean vectors. System number 4 uses physical two-mass vocal estimates data-driven optimization procedure 9 glottal from voiced speech sections. For each utterance minimum, maximum vectors form 27-dimensional vector. last system calculates 219-dimensional prosodic set for voice unvoiced segments. We compare two ways fuse systems: First, we concatenate level. second way combination is performed score level multi-class logistic regression. Despite there just minor differences between approaches, late fusion slightly superior. On development Interspeech Agender challenge achieved an unweighted recall 46.1% with early 47.8% fusion. Index Terms: acoustic analysis, classification, models