Three-Dimensional Lip Motion Network for Text-Independent Speaker Recognition

作者: Ju Zhang , Li Liu , Jianrong Wang , Mei Yu , Qiang Fang

DOI: 10.1109/ICPR48806.2021.9413218

关键词: Context (language use)Speech recognitionFace (geometry)Feature (machine learning)Motion (physics)Speaker recognitionComplement (set theory)BiometricsComputer scienceArtificial intelligenceFacial recognition system

摘要: Lip motion reflects behavior characteristics of speakers, and thus can be used as a new kind biometrics in speaker recognition. In the literature, lots works two-dimensional (2D) lip images to recognize text-dependent context. However, 2D easily suffers from various face orientations. To this end, work, we present novel end-to-end 3D Network (3LMNet) by utilizing sentence-level (S3DLM) speakers both text-independent contexts. A regional feedback module (RFM) is proposed obtain attentions different regions. Besides, prior knowledge investigated complement RFM, where landmark-level frame-level features are merged form better feature representation. Moreover, two methods, i.e., coordinate transformation posture correction pre-process LSD-AV dataset, which contains 68 146 sentences per speaker. The evaluation results on dataset demonstrate that our 3LMNet superior baseline models, LSTM, VGG-16 ResNet-34, outperforms state-of-the-art using image well face. code work released at https://github.com/wutong18/Three-Dimensional-Lip-Motion-Network-for-Text-Independent-Speaker-Recognition.

参考文章(32)
Scott A. King, Richard E. Parent, Barbara L. Olsafsky, A Muscle-based 3D Parametric Lip Model for Speech-Synchronized Facial Animation DEFORM '00/AVATARS '00 Proceedings of the IFIP TC5/WG5.10 DEFORM'2000 Workshop and AVATARS'2000 Workshop on Deformable Avatars. pp. 12- 23 ,(2000) , 10.1007/978-0-306-47002-8_2
Roland Auckenthaler, John S. Mason, Jason Brand, LIP features for speech and speaker recognition european signal processing conference. pp. 1- 4 ,(1998) , 10.5281/ZENODO.36780
Karen Simonyan, Andrew Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition computer vision and pattern recognition. ,(2014)
Zheng-Yu Zhu, Qian-Hua He, Xiao-Hui Feng, Yan-Xiong Li, Zhi-Feng Wang, Liveness detection using time drift between lip movement and voice international conference on machine learning and cybernetics. ,vol. 02, pp. 973- 978 ,(2013) , 10.1109/ICMLC.2013.6890423
Tatsuya Nakata, Masayuki Kashima, Kiminori Sato, Mutsumi Watanabe, Lip-Sync Personal Authentication System Using Movement Feature of Lip international conference on biometrics. pp. 273- 276 ,(2013) , 10.1109/ICBAKE.2013.53
Yingjie Meng, Yingjie Hu, Haiyan Zhang, Xianlong Wang, Speaker identification in time-sequential images based on movements of lips fuzzy systems and knowledge discovery. ,vol. 3, pp. 1729- 1733 ,(2011) , 10.1109/FSKD.2011.6019829
Xin Liu, Yiu-ming Cheung, Learning Multi-Boosted HMMs for Lip-Password Based Speaker Verification IEEE Transactions on Information Forensics and Security. ,vol. 9, pp. 233- 246 ,(2014) , 10.1109/TIFS.2013.2293025
Sepp Hochreiter, Jürgen Schmidhuber, Long short-term memory Neural Computation. ,vol. 9, pp. 1735- 1780 ,(1997) , 10.1162/NECO.1997.9.8.1735
Vikas Reddy Aravabhumi, Rahul Reddy Chenna, Kurli Umashankar Reddy, Robust method to identify the speaker using lip motion features international conference on mechanical and electrical technology. pp. 125- 129 ,(2010) , 10.1109/ICMET.2010.5598333
Girija Chetty, Biometric liveness detection based on cross modal fusion international conference on information fusion. pp. 2255- 2262 ,(2009)