作者: Quan Wang , Yutian Chen , Jason Pelecanos , Yiling Huang
DOI: 10.1109/SLT48900.2021.9383525
关键词:
摘要: In recent years, Text-To-Speech (TTS) has been used as a data augmentation technique for speech recognition to help complement inadequacies in the training data. Correspondingly, we investigate use of multi-speaker TTS system synthesize support speaker recognition. this study focus analysis on tasks where relatively small number speakers is available training. We observe our datasets that synthesized improves cross-domain performance and can be combined effectively with multi-style Additionally, explore effectiveness different types text transcripts synthesis. Results suggest matching textual content target domain good practice, if not feasible, transcript sufficiently large vocabulary recommended.