Improving DNN speaker independence with I-vector inputs

作者: Andrew Senior , Ignacio Lopez-Moreno

DOI: 10.1109/ICASSP.2014.6853591

关键词: Speaker recognitionArtificial neural networkArtificial intelligenceSpeaker diarisationNormalization (statistics)BackpropagationSpeech recognitionI vectorPattern recognitionComputer science

摘要: We propose providing additional utterance-level features as inputs to a deep neural network (DNN) facilitate speaker, channel and background normalization. Modifications of the basic algorithm are developed which result in significant reductions word error rates (WERs). The algorithms shown combine well with speaker adaptation by backpropagation, resulting 9% relative WER reduction. address implementation for streaming task.

参考文章(19)
Victor Abrash, Michael Cohen, Horacio Franco, Ananth Sankar, Connectionist speaker normalization and adaptation. conference of the international speech communication association. ,(1995)
N. Strom, Speaker adaptation by modeling the speaker variation in a continuous speech recognition system international conference on spoken language processing. ,vol. 2, pp. 989- 992 ,(1996) , 10.1109/ICSLP.1996.607769
Michiel Bacchiani, Rapid adaptation for mobile speech applications international conference on acoustics, speech, and signal processing. pp. 7903- 7907 ,(2013) , 10.1109/ICASSP.2013.6639203
Ossama Abdel-Hamid, Hui Jiang, Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code international conference on acoustics, speech, and signal processing. pp. 7942- 7946 ,(2013) , 10.1109/ICASSP.2013.6639211
Jean-Luc Gauvain, Chin-Hui Lee, Bayesian learning of Gaussian mixture densities for hidden Markov models human language technology. pp. 272- 277 ,(1991) , 10.3115/112405.112457
M.J.F. Gales, Maximum likelihood linear transformations for HMM-based speech recognition Computer Speech & Language. ,vol. 12, pp. 75- 98 ,(1998) , 10.1006/CSLA.1998.0043
Hank Liao, Speaker adaptation of context dependent deep neural networks international conference on acoustics, speech, and signal processing. pp. 7947- 7951 ,(2013) , 10.1109/ICASSP.2013.6639212
Hank Liao, Erik McDermott, Andrew Senior, Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription ieee automatic speech recognition and understanding workshop. pp. 368- 373 ,(2013) , 10.1109/ASRU.2013.6707758
Michael L. Seltzer, Dong Yu, Yongqiang Wang, An investigation of deep neural networks for noise robust speech recognition international conference on acoustics, speech, and signal processing. pp. 7398- 7402 ,(2013) , 10.1109/ICASSP.2013.6639100
George Saon, Hagen Soltau, David Nahamoo, Michael Picheny, None, Speaker adaptation of neural network acoustic models using i-vectors ieee automatic speech recognition and understanding workshop. pp. 55- 59 ,(2013) , 10.1109/ASRU.2013.6707705