Synth2Aug: Cross-Domain Speaker Recognition with TTS Synthesized Speech

作者: Quan Wang , Yutian Chen , Jason Pelecanos , Yiling Huang

DOI: 10.1109/SLT48900.2021.9383525

关键词:

摘要: In recent years, Text-To-Speech (TTS) has been used as a data augmentation technique for speech recognition to help complement inadequacies in the training data. Correspondingly, we investigate use of multi-speaker TTS system synthesize support speaker recognition. this study focus analysis on tasks where relatively small number speakers is available training. We observe our datasets that synthesized improves cross-domain performance and can be combined effectively with multi-style Additionally, explore effectiveness different types text transcripts synthesis. Results suggest matching textual content target domain good practice, if not feasible, transcript sufficiently large vocabulary recommended.

参考文章(33)
Vadim Shchemelinin, Konstantin Simonchik, Examining Vulnerability of Voice Verification Systems to Spoofing Attacks by Means of a TTS System international conference on speech and computer. pp. 132- 137 ,(2013) , 10.1007/978-3-319-01931-4_18
Vadim Shchemelinin, Mariia Topchina, Konstantin Simonchik, Vulnerability of Voice Verification Systems to Spoofing Attacks by TTS Voices Based on Automatically Labeled Telephone Speech international conference on speech and computer. pp. 475- 481 ,(2014) , 10.1007/978-3-319-11581-8_59
Rohit Prabhavalkar, Raziel Alvarez, Carolina Parada, Preetum Nakkiran, Tara N. Sainath, Automatic gain control and multi-style training for robust small-footprint keyword spotting with deep neural networks international conference on acoustics, speech, and signal processing. pp. 4704- 4708 ,(2015) , 10.1109/ICASSP.2015.7178863
Chanwoo Kim, Richard M. Stern, Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis. conference of the international speech communication association. pp. 2598- 2601 ,(2008)
Max Welling, Diederik P Kingma, Auto-Encoding Variational Bayes international conference on learning representations. ,(2014)
Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, Javier Gonzalez-Dominguez, Deep neural networks for small footprint text-dependent speaker verification international conference on acoustics, speech, and signal processing. pp. 4052- 4056 ,(2014) , 10.1109/ICASSP.2014.6854363
Georg Heigold, Ignacio Moreno, Samy Bengio, Noam Shazeer, End-to-end text-dependent speaker verification 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 5115- 5119 ,(2016) , 10.1109/ICASSP.2016.7472652
Françoise Beaufays, Andrew W. Senior, Hasim Sak, Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling conference of the international speech communication association. pp. 338- 342 ,(2014)
Konstantin Simonchik, Vadim Shchemelinin, "STC spoofing" database for text-dependent speaker recognition evaluation. SLTU. pp. 221- 224 ,(2014)
Shi-Xiong Zhang, Zhuo Chen, Yong Zhao, Jinyu Li, Yifan Gong, End-to-End attention based text-dependent speaker verification 2016 IEEE Spoken Language Technology Workshop (SLT). pp. 171- 178 ,(2016) , 10.1109/SLT.2016.7846261