SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures

作者: Katerina Zmolikova , Marc Delcroix , Keisuke Kinoshita , Tsubasa Ochiai , Tomohiro Nakatani

DOI: 10.1109/JSTSP.2019.2922820

关键词: Speech recognitionArtificial neural networkSpeech processingComputer scienceAdaptation (computer science)Hidden Markov modelRepresentation (mathematics)Noise (video)Artificial intelligenceDeep learningUtterance

摘要: The processing of speech corrupted by interfering overlapping speakers is one the challenging problems with regards to today's automatic recognition systems. Recently, approaches based on deep learning have made great progress toward solving this problem. Most these tackle problem as separation, i.e., they blindly recover all from mixture. In some scenarios, such smart personal devices, we may however be interested in recovering target speaker a paper, introduce SpeakerBeam, method for extracting mixture an adaptation utterance spoken speaker. Formulating extraction avoids certain issues label permutation and need determine number With jointly learn extract representation characterizing use We explore several ways do this, mostly inspired acoustic models recognition. evaluate performance widely used WSJ0-2mix WSJ0-3mix datasets, datasets modified more noise or realistic patterns. further analyze learned behavior exploring representations assessing effect length data. results show benefit including information effectiveness proposed method.

参考文章(50)
Petr Motlicek, Georg Stemmer, Ondrej Glembek, Karel Vesely, Lukas Burget, Gilles Boulianne, Yanmin Qian, Mirko Hannemann, Nagendra Goel, Petr Schwarz, Arnab Ghoshal, Jan Silovsky, Daniel Povey, The Kaldi Speech Recognition Toolkit ieee automatic speech recognition and understanding workshop. ,(2011)
Yoshua Bengio, Xavier Glorot, Understanding the difficulty of training deep feedforward neural networks international conference on artificial intelligence and statistics. pp. 249- 256 ,(2010)
Marc Delcroix, Keisuke Kinoshita, Takaaki Hori, Tomohiro Nakatani, Context adaptive deep neural networks for fast acoustic model adaptation international conference on acoustics, speech, and signal processing. pp. 4535- 4539 ,(2015) , 10.1109/ICASSP.2015.7178829
Kenichi Kumatani, John McDonough, Bhiksha Raj, Microphone Array Processing for Distant Speech Recognition: From Close-Talking Microphones to Far-Field Sensors IEEE Signal Processing Magazine. ,vol. 29, pp. 127- 140 ,(2012) , 10.1109/MSP.2012.2205285
Andrew Senior, Ignacio Lopez-Moreno, Improving DNN speaker independence with I-vector inputs international conference on acoustics, speech, and signal processing. pp. 225- 229 ,(2014) , 10.1109/ICASSP.2014.6853591
Keisuke Kinoshita, Marc Delcroix, Takuya Yoshioka, Tomohiro Nakatani, Emanuel Habets, Reinhold Haeb-Umbach, Volker Leutnant, Armin Sehr, Walter Kellermann, Roland Maas, Sharon Gannot, Bhiksha Raj, The reverb challenge: Acommon evaluation framework for dereverberation and recognition of reverberant speech workshop on applications of signal processing to audio and acoustics. pp. 1- 4 ,(2013) , 10.1109/WASPAA.2013.6701894
Steven Rennie, John Hershey, Peder Olsen, Single-Channel Multitalker Speech Recognition IEEE Signal Processing Magazine. ,vol. 27, pp. 66- 80 ,(2010) , 10.1109/MSP.2010.938081
Guy J. Brown, Martin Cooke, Computational auditory scene analysis Computer Speech & Language. ,vol. 8, pp. 297- 336 ,(1994) , 10.1006/CSLA.1994.1016
Yuxuan Wang, Arun Narayanan, DeLiang Wang, On training targets for supervised speech separation IEEE Transactions on Audio, Speech, and Language Processing. ,vol. 22, pp. 1849- 1858 ,(2014) , 10.1109/TASLP.2014.2352935
Jon Barker, Emmanuel Vincent, Ning Ma, Heidi Christensen, Phil Green, The PASCAL CHiME speech separation and recognition challenge Computer Speech & Language. ,vol. 27, pp. 621- 633 ,(2013) , 10.1016/J.CSL.2012.10.004