Two extensions to ensemble speaker and speaking environment modeling for robust automatic speech recognition

作者: Yu Tsao , Chin-Hui Lee

DOI: 10.1109/ASRU.2007.4430087

关键词: Discriminative modelGaussian processHidden Markov modelGaussianArtificial intelligenceCluster analysisSpeech recognitionSpeaker recognitionPattern recognitionAffine transformationComputer scienceDigit recognition

摘要: Recently an ensemble speaker and speaking environment modeling (ESSEM) approach to characterizing unknown testing environments was studied for robust speech recognition. Each is modeled by a super-vector consisting of the entire set mean vectors from all Gaussian densities HMMs particular environment. The new then obtained affine transformation on super-vectors. In this paper, we propose minimum classification error training procedure obtain discriminative elements, clustering technique achieve refined structures. We test these two extentions ESSEM Aurora2. per-utterance unsupervised adaptation mode achieved average WER 4.99% OdB 20 dB conditions with when compared 5.51% ML-trained gender-dependent baseline. To our knowledge represents best result reported in literature Aurora2 connected digit recognition task.

参考文章(11)
Qiang Huo, Jian Wu, Several HKU approaches for robust speech recognition and their evaluation on Aurora connected digit recognition tasks. conference of the international speech communication association. ,(2003)
Yu Tsao, Chin-Hui Lee, An ensemble modeling approach to joint characterization of speaker and speaking environments. conference of the international speech communication association. pp. 1050- 1053 ,(2007)
Yu Tsao, Chin-Hui Lee, A Vector Space Approach to Environment Modeling for Robust Speech Recognition conference of the international speech communication association. ,(2006)
Laurent Mauuary, Holly Kelleher, Douglas Ealey, Yan Ming Cheng, Fabien Saadoun, Duncan Macho, Denis Jouvet, Bernhard Noé, David Pearce, Evaluation of a noise-robust DSR front-end on Aurora databases. conference of the international speech communication association. ,(2002)
Nicholas Metropolis, S. Ulam, The Monte Carlo Method Journal of the American Statistical Association. ,vol. 44, pp. 335- 341 ,(1949) , 10.1080/01621459.1949.10483310
A. Sankar, Chin-Hui Lee, A maximum-likelihood approach to stochastic matching for robust speech recognition IEEE Transactions on Speech and Audio Processing. ,vol. 4, pp. 190- 202 ,(1996) , 10.1109/89.496215
J.-L. Gauvain, Chin-Hui Lee, Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains IEEE Transactions on Speech and Audio Processing. ,vol. 2, pp. 291- 298 ,(1994) , 10.1109/89.279278
Dirk Van Compernolle, Noise adaptation in a hidden Markov model speech recognition system Computer Speech & Language. ,vol. 3, pp. 151- 167 ,(1989) , 10.1016/0885-2308(89)90027-2
Christopher J Leggetter, Philip C Woodland, None, Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models Computer Speech & Language. ,vol. 9, pp. 171- 185 ,(1995) , 10.1006/CSLA.1995.0010
Hans-Günter Hirsch, David Pearce, THE AURORA EXPERIMENTAL FRAMEWORK FOR THE PERFORMANCE EVALUATION OF SPEECH RECOGNITION SYSTEMS UNDER NOISY CONDITIONS conference of the international speech communication association. ,vol. 4, pp. 29- 32 ,(2000)