作者: Herman Kamper , Micha Elsner , Aren Jansen , Sharon Goldwater
DOI: 10.1109/ICASSP.2015.7179087
关键词:
摘要: Deep neural networks (DNNs) have become a standard component in supervised ASR, used both data-driven feature extraction and acoustic modelling. Supervision is typically obtained from forced alignment that provides phone class targets, requiring transcriptions pronunciations. We propose novel unsupervised DNN-based extractor can be trained without these resources zero-resource settings. Using term discovery, we find pairs of isolated word examples the same unknown type; provide weak top-down supervision. For each pair, dynamic programming to align frames two words. Matching are presented as input-output deep autoencoder (AE) network. this AE discrimination task, achieve 64% relative improvement over previous state-of-the-art system, 57% bottom-up AE, come within 23% system.