DOI: 10.1016/J.CSL.2009.03.003
关键词:
摘要: The basic goal of the voice conversion system is to modify speaker-specific characteristics, keeping message and environmental information contained in speech signal intact. Speaker characteristics reflect at different levels, such as, shape glottal pulse (excitation source characteristics), vocal tract (vocal characteristics) long-term features (suprasegmental or prosodic characteristics). In this paper, we are proposing neural network models for developing mapping functions each level. used extracted using pitch synchronous analysis. Pitch analysis provides estimation accurate parameters, by analyzing independently period without influenced adjacent cycles. work, instants significant excitation as markers perform correspond closure (epochs) case voiced speech, some random excitations like onset burst nonvoiced speech. Instants computed from linear prediction (LP) residual signals property average group-delay minimum phase signals. line spectral frequencies (LSFs) representing its associated function. LP viewed source, samples around instant mapping. Prosodic parameters syllable phrase levels deriving Source level derived synchronously, incorporation target performed synchronously excitation. performance evaluated listening tests. accuracy (neural models) proposed further objective measures deviation (D"i), root mean square error (@m"R"M"S"E) correlation coefficient (@c"X","Y). approach (i.e., modification approach) shown be better compared earlier method (mapping block processing) author.