作者: Ju Zhang , Li Liu , Jianrong Wang , Mei Yu , Qiang Fang
DOI: 10.1109/ICPR48806.2021.9413218
关键词: Context (language use) 、 Speech recognition 、 Face (geometry) 、 Feature (machine learning) 、 Motion (physics) 、 Speaker recognition 、 Complement (set theory) 、 Biometrics 、 Computer science 、 Artificial intelligence 、 Facial recognition system
摘要: Lip motion reflects behavior characteristics of speakers, and thus can be used as a new kind biometrics in speaker recognition. In the literature, lots works two-dimensional (2D) lip images to recognize text-dependent context. However, 2D easily suffers from various face orientations. To this end, work, we present novel end-to-end 3D Network (3LMNet) by utilizing sentence-level (S3DLM) speakers both text-independent contexts. A regional feedback module (RFM) is proposed obtain attentions different regions. Besides, prior knowledge investigated complement RFM, where landmark-level frame-level features are merged form better feature representation. Moreover, two methods, i.e., coordinate transformation posture correction pre-process LSD-AV dataset, which contains 68 146 sentences per speaker. The evaluation results on dataset demonstrate that our 3LMNet superior baseline models, LSTM, VGG-16 ResNet-34, outperforms state-of-the-art using image well face. code work released at https://github.com/wutong18/Three-Dimensional-Lip-Motion-Network-for-Text-Independent-Speaker-Recognition.