Automatic Grammar Augmentation for Robust Voice Command Recognition

作者: Yang Yang , Anusha Lalitha , Jinwon Lee , Chris Lott

DOI: 10.1109/ICASSP.2019.8682157

关键词: Noise (video)PronunciationSet (abstract data type)Acoustic modelPipeline (computing)Computer scienceStress (linguistics)Speech recognitionGrammar

摘要: This paper proposes a novel pipeline for automatic grammar augmentation that provides significant improvement in the voice command recognition accuracy systems with small footprint acoustic model (AM). The is achieved by augmenting user-defined set, also called alternate expressions. For given set of potential expressions (candidate set) constructed from an AM-specific statistical pronunciation dictionary captures consistent patterns and errors decoding AM induced variations pronunciation, pitch, tempo, accent, ambiguous spellings, noise conditions. Using this candidate greedy optimization based cross-entropy-method (CEM) algorithms are considered to search augmented improved utilizing command-specific dataset. Our experiments show proposed along significantly reduce mis-detection mis-classification rate without increasing false-alarm rate. Experiments demonstrate superior performance CEM method over greedy-based algorithms.

参考文章(13)
Yoshua Bengio, Dmitriy Serdyuk, Jan Chorowski, Kyunghyun Cho, Dzmitry Bahdanau, Attention-based models for speech recognition neural information processing systems. ,vol. 28, pp. 577- 585 ,(2015)
Vassil Panayotov, Guoguo Chen, Daniel Povey, Sanjeev Khudanpur, Librispeech: An ASR corpus based on public domain audio books international conference on acoustics, speech, and signal processing. pp. 5206- 5210 ,(2015) , 10.1109/ICASSP.2015.7178964
Alex Graves, Sequence Transduction with Recurrent Neural Networks arXiv: Neural and Evolutionary Computing. ,(2012)
Awni Y. Hannun, Andrew Y. Ng, Sanjeev Satheesh, Jared Casper, Adam Coates, Shubho Sengupta, Greg Diamos, Erich Elsen, Ryan Prenger, Bryan Catanzaro, Carl Case, Deep Speech: Scaling up end-to-end speech recognition arXiv: Computation and Language. ,(2014)
Guoguo Chen, Carolina Parada, Georg Heigold, Small-footprint keyword spotting using deep neural networks international conference on acoustics, speech, and signal processing. pp. 4087- 4091 ,(2014) , 10.1109/ICASSP.2014.6854370
István Szita, András Lörincz, Learning tetris using the noisy cross-entropy method Neural Computation. ,vol. 18, pp. 2936- 2941 ,(2006) , 10.1162/NECO.2006.18.12.2936
Alex Graves, Santiago Fernández, Faustino Gomez, Jürgen Schmidhuber, Connectionist temporal classification Proceedings of the 23rd international conference on Machine learning - ICML '06. pp. 369- 376 ,(2006) , 10.1145/1143844.1143891
Pieter-Tjerk de Boer, Dirk P. Kroese, Shie Mannor, Reuven Y. Rubinstein, A Tutorial on the Cross-Entropy Method Annals of Operations Research. ,vol. 134, pp. 19- 67 ,(2005) , 10.1007/S10479-005-5724-Z
Dani Yogatama, Dario Amodei, Eric Battenberg, Awni Hannun, Andrew Ng, Sanjeev Satheesh, Jared Casper, Adam Coates, Sherjil Ozair, Chong Wang, Zongfeng Quan, Shubho Sengupta, Linxi Fan, Greg Diamos, Ke Ding, Mike Chrzanowski, Jesse Engel, Sundaram Ananthanarayanan, Tony Han, Bo Xiao, Sharan Narang, Christopher Fougner, Liang Gao, Bing Jiang, Kaifu Wang, Lappi Vaino Johannes, Yan Xie, Zhijie Chen, Weigao Li, Likai Wei, Jun Zhan, Haiyuan Tang, Billy Jun, Kavya Srinet, Yiping Peng, Dongpeng Ma, Weiwei Fang, Niandong Du, Erich Elsen, Shuang Wu, Yang Liu, Zhenyao Zhu, Guoliang Chen, Bin Yuan, Patrick LeGresley, Jie Chen, Caixia Gong, Wen Xie, Sheng Qian, David Seetapun, Liliang Tang, Jidong Wang, Cai Ju, Junjie Liu, Zhijian Wang, Ryan Prenger, Qiang Cheng, Xiangang Li, Yi Wang, Bryan Catanzaro, Carl Case, Jonathan Raiman, Jingliang Bai, Zhiqian Wang, Rishita Anubhai, Libby Lin, Vinay Rao, Anuroop Sriram, Jingdong Chen, Deep speech 2: end-to-end speech recognition in English and mandarin international conference on machine learning. pp. 173- 182 ,(2016)
Tara N. Sainath, Carolina Parada, Convolutional Neural Networks for Small-Footprint Keyword Spotting conference of the international speech communication association. pp. 1478- 1482 ,(2015)