Emotions Don't Lie: An Audio-Visual Deepfake Detection Method using Affective Cues

作者: Trisha Mittal , Uttaran Bhattacharya , Rohan Chandra , Aniket Bera , Dinesh Manocha

DOI: 10.1145/3394171.3413570

关键词: Computer scienceAudio visualSimilarity (psychology)Speech recognitionArtificial intelligenceAffective computingNetwork architectureModalitiesMetric (mathematics)Deep learningTriplet loss

摘要: We present a learning-based method for detecting real and fake deepfake multimedia content. To maximize information learning, we extract analyze the similarity between two audio visual modalities from within same video. Additionally, compare affective cues corresponding to perceived emotion video infer whether input is "real" or "fake". propose deep learning network, inspired by Siamese network architecture triplet loss. validate our model, report AUC metric on large-scale detection datasets, DeepFake-TIMIT Dataset DFDC. approach with several SOTA methods per-video of 84.4% DFDC 96.6% DF-TIMIT respectively. best knowledge, ours first that simultaneously exploits also emotions detection.

参考文章(40)
Karen S. Quigley, Kristen A. Lindquist, Lisa Feldman Barrett, Inducing and Measuring Emotion and Affect Handbook of Research Methods in Social and Personality Psychology. pp. 220- 252 ,(2014) , 10.1017/CBO9780511996481.014
Maja Pantic, Nicu Sebe, Jeffrey F. Cohn, Thomas Huang, Affective multimodal human-computer interaction Proceedings of the 13th annual ACM international conference on Multimedia - MULTIMEDIA '05. pp. 669- 676 ,(2005) , 10.1145/1101149.1101299
Paul Ekman, Wallace V. Freisen, Sonia Ancoli, Facial signs of emotional experience. Journal of Personality and Social Psychology. ,vol. 39, pp. 1125- 1134 ,(1980) , 10.1037/H0077722
Hatice Gunes, Massimo Piccardi, Bi-modal emotion recognition from expressive face and body gestures Journal of Network and Computer Applications. ,vol. 30, pp. 1334- 1345 ,(2007) , 10.1016/J.JNCA.2006.09.007
Mihai Gurban, Jean-Philippe Thiran, Thomas Drugman, Thierry Dutoit, Dynamic modality weighting for multi-stream hmms inaudio-visual speech recognition Proceedings of the 10th international conference on Multimodal interfaces - IMCI '08. pp. 237- 240 ,(2008) , 10.1145/1452392.1452442
L. A. Ross, D. Saint-Amour, V. M. Leavitt, D. C. Javitt, J. J. Foxe, Do You See What I Am Saying? Exploring Visual Enhancement of Speech Comprehension in Noisy Environments Cerebral Cortex. ,vol. 17, pp. 1147- 1153 ,(2006) , 10.1093/CERCOR/BHL024
Jean-Luc Schwartz, Frédéric Berthommier, Christophe Savariaux, Seeing to hear better: evidence for early audio-visual interactions in speech identification. Cognition. ,vol. 93, ,(2004) , 10.1016/J.COGNITION.2004.01.006
C. Shan, S. Gong, P. W. McOwan, Beyond Facial Expressions: Learning Human Emotion from Body Gestures british machine vision conference. pp. 1- 10 ,(2007) , 10.5244/C.21.43