Speech recognition in living rooms: Integrated speech enhancement and recognition system based on spatial, spectral and temporal modeling of sounds

作者: Marc Delcroix , Keisuke Kinoshita , Tomohiro Nakatani , Shoko Araki , Atsunori Ogawa

DOI: 10.1016/J.CSL.2012.07.006

关键词: Voice activity detectionSpeech processingAcoustic modelLinear predictive codingCompensation (engineering)Computer scienceSpeech recognitionSpeech enhancementNoiseChannel (digital image)

摘要: Research on noise robust speech recognition has mainly focused dealing with relatively stationary that may differ from the conditions in most living environments. In this paper, we introduce a system can recognize presence of multiple rapidly time-varying sources as found typical family room. To deal such severe conditions, our exploits all available information about and noise; is spatial (directional), spectral temporal information. This realized model-based enhancement pre-processor, which consists two complementary elements, multi-channel speech-noise separation method information, followed by single channel algorithm uses long-term characteristics obtained clean examples. Moreover, to compensate for any mismatch remain between enhanced acoustic model, employs an adaptation technique combines conventional maximum likelihood linear regression dynamic adaptive compensation variance Gaussians model. Our proposed approaches human performance levels greatly improving audible quality substantially keyword accuracy.

参考文章(39)
Sam T. Roweis, Factorial models and refiltering for speech separation and denoising. conference of the international speech communication association. ,(2003)
Marc Delcroix, Shinji Watanabe, Tomohiro Nakatani, Variance Compensation for Recognition of Reverberant Speech with Dereverberation Preprocessing Robust Speech Recognition of Uncertain or Missing Data. pp. 225- 255 ,(2011) , 10.1007/978-3-642-21317-5_9
Armin Sehr, Christian Hofmann, Roland Maas, Walter Kellermann, A novel approach for matched reverberant training of HMMs using data pairs. conference of the international speech communication association. pp. 566- 569 ,(2010)
G Evermann, PC Woodland, Posterior probability decoding, confidence estimation and system combination NIST: National Institute of Standards and Technology. ,(2000)
Erik McDermott, Shinji Watanabe, Atsushi Nakamura, Discriminative training based on an integrated view of MPE and MMI in margin and error space 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 4894- 4897 ,(2010) , 10.1109/ICASSP.2010.5495106
O. Yilmaz, S. Rickard, Blind separation of speech mixtures via time-frequency masking IEEE Transactions on Signal Processing. ,vol. 52, pp. 1830- 1847 ,(2004) , 10.1109/TSP.2004.828896
Steven Rennie, John Hershey, Peder Olsen, Single-Channel Multitalker Speech Recognition IEEE Signal Processing Magazine. ,vol. 27, pp. 66- 80 ,(2010) , 10.1109/MSP.2010.938081
G.J. Freij, F. Fallside, C. Hoequist, F. Nolan, Lexical stress estimation and phonological knowledge Computer Speech & Language. ,vol. 4, pp. 1- 15 ,(1990) , 10.1016/0885-2308(90)90020-7
Do Yeong Kim, Chong Kwan Un, Nam Soo Kim, Speech recognition in noisy environments using first-order vector Taylor series Speech Communication. ,vol. 24, pp. 39- 49 ,(1998) , 10.1016/S0167-6393(97)00061-7