Unsupervised neural network based feature extraction using weak top-down constraints

作者: Herman Kamper , Micha Elsner , Aren Jansen , Sharon Goldwater

DOI: 10.1109/ICASSP.2015.7179087

关键词:

摘要: Deep neural networks (DNNs) have become a standard component in supervised ASR, used both data-driven feature extraction and acoustic modelling. Supervision is typically obtained from forced alignment that provides phone class targets, requiring transcriptions pronunciations. We propose novel unsupervised DNN-based extractor can be trained without these resources zero-resource settings. Using term discovery, we find pairs of isolated word examples the same unknown type; provide weak top-down supervision. For each pair, dynamic programming to align frames two words. Matching are presented as input-output deep autoencoder (AE) network. this AE discrimination task, achieve 64% relative improvement over previous state-of-the-art system, 57% bottom-up AE, come within 23% system.

参考文章(25)
Hynek Hermansky, Aren Jansen, Kenneth Church, Towards Spoken Term Discovery At Scale With Zero Resources conference of the international speech communication association. pp. 1676- 1679 ,(2010)
Man-Hung Siu, Herbert Gish, Arthur Chan, William Belfield, Unsupervised training of an HMM-based speech recognizer for topic classification. conference of the international speech communication association. pp. 1935- 1938 ,(2009)
Ian J Goodfellow, David Warde-Farley, Pascal Lamblin, Vincent Dumoulin, Mehdi Mirza, Razvan Pascanu, James Bergstra, Frédéric Bastien, Yoshua Bengio, None, Pylearn2: a machine learning research library arXiv: Machine Learning. ,(2013)
Aren Jansen, Samuel Thomas, Hynek Hermansky, Weak top-down constraints for unsupervised acoustic model training international conference on acoustics, speech, and signal processing. pp. 8091- 8095 ,(2013) , 10.1109/ICASSP.2013.6639241
M.J. Hunt, S.M. Richardson, D.C. Bateman, A. Piau, An investigation of PLP and IMELDA acoustic representations and of their potential for combination international conference on acoustics, speech, and signal processing. pp. 881- 884 ,(1991) , 10.1109/ICASSP.1991.150480
Leonardo Badino, Claudia Canevari, Luciano Fadiga, Giorgio Metta, An auto-encoder based approach to unsupervised learning of subword units international conference on acoustics, speech, and signal processing. pp. 7634- 7638 ,(2014) , 10.1109/ICASSP.2014.6855085
Oliver Walter, Timo Korthals, Reinhold Haeb-Umbach, Bhiksha Raj, A hierarchical system for word discovery exploiting DTW-based initialization ieee automatic speech recognition and understanding workshop. pp. 386- 391 ,(2013) , 10.1109/ASRU.2013.6707761
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, Pierre-Antoine Manzagol, Extracting and composing robust features with denoising autoencoders Proceedings of the 25th international conference on Machine learning - ICML '08. pp. 1096- 1103 ,(2008) , 10.1145/1390156.1390294
M.D. Zeiler, M. Ranzato, R. Monga, M. Mao, K. Yang, Q.V. Le, P. Nguyen, A. Senior, V. Vanhoucke, J. Dean, G.E. Hinton, On rectified linear units for speech processing international conference on acoustics, speech, and signal processing. pp. 3517- 3521 ,(2013) , 10.1109/ICASSP.2013.6638312
Gabriel Synnaeve, Thomas Schatz, Emmanuel Dupoux, Phonetics embedding learning with side information spoken language technology workshop. pp. 106- 111 ,(2014) , 10.1109/SLT.2014.7078558