作者: Katerina Zmolikova , Marc Delcroix , Keisuke Kinoshita , Tsubasa Ochiai , Tomohiro Nakatani
DOI: 10.1109/JSTSP.2019.2922820
关键词: Speech recognition 、 Artificial neural network 、 Speech processing 、 Computer science 、 Adaptation (computer science) 、 Hidden Markov model 、 Representation (mathematics) 、 Noise (video) 、 Artificial intelligence 、 Deep learning 、 Utterance
摘要: The processing of speech corrupted by interfering overlapping speakers is one the challenging problems with regards to today's automatic recognition systems. Recently, approaches based on deep learning have made great progress toward solving this problem. Most these tackle problem as separation, i.e., they blindly recover all from mixture. In some scenarios, such smart personal devices, we may however be interested in recovering target speaker a paper, introduce SpeakerBeam, method for extracting mixture an adaptation utterance spoken speaker. Formulating extraction avoids certain issues label permutation and need determine number With jointly learn extract representation characterizing use We explore several ways do this, mostly inspired acoustic models recognition. evaluate performance widely used WSJ0-2mix WSJ0-3mix datasets, datasets modified more noise or realistic patterns. further analyze learned behavior exploring representations assessing effect length data. results show benefit including information effectiveness proposed method.