One-to-Many Voice Conversion Based on Tensor Representation of Speaker Space

作者: Nobuaki Minematsu , Keikichi Hirose , Daisuke Saito , Keisuke Yamamoto

DOI:

关键词:

摘要: This paper describes a novel approach to flexible control of speaker characteristics using tensor representation space. In voice conversion studies, realization from/to an arbitrary speaker’s is one the important objectives. For this purpose, eigenvoice (EVC) based on Gaussian mixture model (EV-GMM) was proposed. EVC, similarly recognition approaches, space constructed GMM supervectors which are high-dimensional vectors derived by concatenating mean each GMMs. space, represented small number weight parameters eigen-supervectors. paper, we revisit construction introducing analysis training data set. our approach, as matrix row and column respectively correspond component dimension vector, set matrices. Our can solve inherent problem supervector representation, it improves performance conversion. Experimental results oneto-many demonstrate effectiveness proposed approach. Index Terms: conversion, model, eigenvoice, analysis, Tucker decomposition

参考文章(21)
Aki Kunikoshi, Nobuaki Minematsu, Keikichi Hirose, Yu Qiao, Speech Generation from Hand Gestures Based on Space Mapping conference of the international speech communication association. pp. 308- 311 ,(2009)
Chung-Hsien Wu, Chung-Han Lee, Map-based adaptation for speech conversion using adaptation data selection and non-parallel training. conference of the international speech communication association. ,(2006)
Tomoki Toda, Yamato Ohtani, Kiyohiro Shikano, Eigenvoice Conversion Based on Gaussian Mixture Model conference of the international speech communication association. ,(2006)
Hiroshi Saruwatari, Tomoki Toda, Yamato Ohtani, Kiyohiro Shikano, Speaker Adaptive Training for One-to-Many Eigenvoice Conversion Based on Gaussian Mixture Model conference of the international speech communication association. pp. 1981- 1984 ,(2007)
Li Deng, A. Acero, Li Jiang, J. Droppo, Xuedong Huang, High-performance robust speech recognition using stereo training data international conference on acoustics, speech, and signal processing. ,vol. 1, pp. 301- 304 ,(2001) , 10.1109/ICASSP.2001.940827
Ledyard R Tucker, Some mathematical notes on three-mode factor analysis Psychometrika. ,vol. 31, pp. 279- 311 ,(1966) , 10.1007/BF02289464
Akira Kurematsu, Kazuya Takeda, Yoshinori Sagisaka, Shigeru Katagiri, Hisao Kuwabara, Kiyohiro Shikano, ATR Japanese speech database as a tool of speech recognition and synthesis Speech Communication. ,vol. 9, pp. 357- 363 ,(1990) , 10.1016/0167-6393(90)90011-W
Tomoki Toda, Yamato Ohtani, Kiyohiro Shikano, One-to-Many and Many-to-One Voice Conversion Based on Eigenvoices international conference on acoustics, speech, and signal processing. ,vol. 4, pp. 1249- 1252 ,(2007) , 10.1109/ICASSP.2007.367303
Yongwon Jeong, Speaker adaptation based on the multilinear decomposition of training speaker models 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 4870- 4873 ,(2010) , 10.1109/ICASSP.2010.5495117
Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, Kiyohiro Shikano, Non-parallel training for many-to-many eigenvoice conversion 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 4822- 4825 ,(2010) , 10.1109/ICASSP.2010.5495139