Speaker adaptation of deep neural networks using a hierarchy of output layers

作者: Ryan Price , Ken-ichi Iso , Koichi Shinoda

DOI: 10.1109/SLT.2014.7078566

关键词:

摘要: Deep neural networks (DNN) used for acoustic modeling in speech recognition often have a very large number of output units corresponding to context dependent (CD) triphone HMM states. The amount data available speaker adaptation is limited so majority these CD states may not be observed during adaptation. In this case, the posterior probabilities unseen are only pushed towards zero DNN and ability predict can degraded relative independent network. We address problem by appending an additional layer which maps original set classes smaller phonetic (e.g. monophones) thereby reducing occurrences data. Adaptation proceeds backpropagation errors from new layer, disregarded at time when over used. demonstrate benefits approach adapting network with using experiments on Japanese voice search task obtain 5.03% reduction character error rate approximately 60 seconds

参考文章(20)
Khe Chai Sim, Bo Li, Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems. conference of the international speech communication association. pp. 526- 529 ,(2010)
Steve Renals, Mike Hochberg, Luís Nunes, João Paulo Neto, Ciro Martins, Luís B. Almeida, Tony Robinson, Speaker-Adaptation for Hybrid HMM-ANN Continuous Speech Recognition System conference of the international speech communication association. pp. 2171- 2174 ,(1995)
Victor Abrash, Michael Cohen, Horacio Franco, Ananth Sankar, Connectionist speaker normalization and adaptation. conference of the international speech communication association. ,(1995)
T. Anastasakos, J. McDonough, R. Schwartz, J. Makhoul, A compact model for speaker-adaptive training international conference on spoken language processing. ,vol. 2, pp. 1137- 1140 ,(1996) , 10.1109/ICSLP.1996.607807
Jan Trmal, Jan Zelinka, Ludek Müller, On Speaker Adaptive Training of Artificial Neural Networks conference of the international speech communication association. pp. 554- 557 ,(2010)
Andrew Senior, Ignacio Lopez-Moreno, Improving DNN speaker independence with I-vector inputs international conference on acoustics, speech, and signal processing. pp. 225- 229 ,(2014) , 10.1109/ICASSP.2014.6853591
Dong Yu, Kaisheng Yao, Hang Su, Gang Li, Frank Seide, KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition international conference on acoustics, speech, and signal processing. pp. 7893- 7897 ,(2013) , 10.1109/ICASSP.2013.6639201
Hank Liao, Speaker adaptation of context dependent deep neural networks international conference on acoustics, speech, and signal processing. pp. 7947- 7951 ,(2013) , 10.1109/ICASSP.2013.6639212
Vishwa Gupta, Patrick Kenny, Pierre Ouellet, Themos Stafylakis, I-vector-based speaker adaptation of deep neural networks for French broadcast audio transcription international conference on acoustics, speech, and signal processing. pp. 6334- 6338 ,(2014) , 10.1109/ICASSP.2014.6854823
Tsubasa Ochiai, Shigeki Matsuda, Xugang Lu, Chiori Hori, Shigeru Katagiri, SPEAKER ADAPTIVE TRAINING USING DEEP NEURAL NETWORKS international conference on acoustics, speech, and signal processing. pp. 6349- 6353 ,(2014) , 10.1109/ICASSP.2014.6854826