Boosted Hybrid DNN/HMM System Based on Correlation-Generated Targets

作者: Mengzhe Chen , Qingqing Zhang , Jielin Pan , Yonghong Yan

DOI: 10.1109/IIH-MSP.2014.153

关键词:

摘要: In current DNN/HMM hybrid systems, the DNN models are trained by 1-of-V targets which obtained Viterbi-based forced-alignment. The states viewed as unrelated and isolated. fact, some phonemes acoustically similar. Especially for Chinese, a tonal language, its number of similar pairs is quadrupled. To add similarity information between into model training, correlation-generated investigated in modeling. For each frame, besides target state from forced-alignment, other to this will be assigned nonzero values. degrees measured through calculating correlation. paper, different methods generating correlation matrix were investigated, details implementation with described. On task Mandarin conversational speech recognition customer-service domain, experiments showed that System based on achieved consistent improvements amounts training data.

参考文章(14)
Xi-Xian Chen, Chang-Nian Cai, Peng Guo, Ying Sun, A hidden Markov model applied to Chinese four-tone recognition international conference on acoustics, speech, and signal processing. ,vol. 12, pp. 797- 800 ,(1987) , 10.1109/ICASSP.1987.1169595
Yuhong Guo, Ta Li, Yujing Si, Jielin Pan, Yonghong Yan, Optimized large vocabulary WFST speech recognition system fuzzy systems and knowledge discovery. pp. 1243- 1247 ,(2012) , 10.1109/FSKD.2012.6234200
Abdel-rahman Mohamed, George E. Dahl, Geoffrey Hinton, Acoustic Modeling Using Deep Belief Networks IEEE Transactions on Audio, Speech, and Language Processing. ,vol. 20, pp. 14- 22 ,(2012) , 10.1109/TASL.2011.2109382
Tara N. Sainath, Brian Kingsbury, Bhuvana Ramabhadran, Petr Fousek, Petr Novak, Abdel-rahman Mohamed, Making Deep Belief Networks effective for large vocabulary continuous speech recognition ieee automatic speech recognition and understanding workshop. pp. 30- 35 ,(2011) , 10.1109/ASRU.2011.6163900
Qingqing Zhang, Frank Soong, Yao Qian, Zhijie Yan, Jielin Pan, Yonghong Yan, Improved modeling for F0 generation and V/U decision in HMM-based TTS 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 4606- 4609 ,(2010) , 10.1109/ICASSP.2010.5495561
N. Bonneau, M. Debbah, E. Altman, G. Caire, Spectral efficiency of CDMA uplink cellular networks international conference on acoustics, speech, and signal processing. ,vol. 5, pp. 821- 824 ,(2005) , 10.1109/ICASSP.2005.1416430
Yonghong Yan, Understanding speech recognition using correlation-generated neural network targets IEEE Transactions on Speech and Audio Processing. ,vol. 7, pp. 350- 352 ,(1999) , 10.1109/89.759046
Qingqing Zhang, Jielin Pan, Yonghong Yan, Mandarin-English bilingual Speech Recognition for real world music retrieval international conference on acoustics, speech, and signal processing. pp. 4253- 4256 ,(2008) , 10.1109/ICASSP.2008.4518594
G. E. Dahl, Dong Yu, Li Deng, A. Acero, Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition IEEE Transactions on Audio, Speech, and Language Processing. ,vol. 20, pp. 30- 42 ,(2012) , 10.1109/TASL.2011.2134090
Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, Brian Kingsbury, None, Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups IEEE Signal Processing Magazine. ,vol. 29, pp. 82- 97 ,(2012) , 10.1109/MSP.2012.2205597