作者: A.G. Adami , R. Mihaescu , D.A. Reynolds , J.J. Godfrey
DOI: 10.1109/ICASSP.2003.1202761
关键词:
摘要: Most current state-of-the-art automatic speaker recognition systems extract speaker-dependent features by looking at short-term spectral information. This approach ignores long-term information that can convey supra-segmental information, such as prosodics and speaking style. We propose two approaches use the fundamental frequency energy trajectories to capture The first uses bigram models model dynamics of for each speaker. second a predefined set words templates then, using dynamic time warping, computes distance between from test message. results presented in this work are on Switchboard I NIST Extended Data evaluation design. show these achieve an equal error rate 3.7%, which is 77% relative improvement over system based pitch alone.