Combining video, audio and lexical indicators of affect in spontaneous conversation via particle filtering

作者: Arman Savran , Houwei Cao , Miraj Shah , Ani Nenkova , Ragini Verma

DOI: 10.1145/2388676.2388781

关键词:

摘要: We present experiments on fusing facial video, audio and lexical indicators for affect estimation during dyadic conversations. use temporal statistics of texture descriptors extracted from a combination various acoustic features, features to create regression based estimators each modality. The single modality regressors are then combined using particle filtering, by treating these independent outputs as measurements the states in Bayesian filtering framework, where previous observations provide prediction about current state means learned dynamics. Tested Audio-visual Emotion Recognition Challenge dataset, our achieve substantially higher scores than official baseline method every dimension affect. Our filtering-based multi-modality fusion achieves correlation performance 0.344 (baseline: 0.136) 0.280 0.096) fully continuous word level sub challenges, respectively.

参考文章(15)
Paul Ekman, Wallace V. Friesen, Facial Action Coding System PsycTESTS Dataset. ,(2019) , 10.1037/T27734-000
P. D. Turney, P. Pantel, From frequency to meaning: vector space models of semantics Journal of Artificial Intelligence Research. ,vol. 37, pp. 141- 188 ,(2010) , 10.1613/JAIR.2934
Abe Kazemzadeh, Zhigang Deng, Murtaza Bulut, Carlos Busso, Serdar Yildirim, Chul Min Lee, Sungbok Lee, Shrikanth S. Narayanan, Emotion recognition based on phoneme classes conference of the international speech communication association. ,(2004)
Arman Savran, Bulent Sankur, M. Taha Bilge, Regression-based intensity estimation of facial action units Image and Vision Computing. ,vol. 30, pp. 774- 784 ,(2012) , 10.1016/J.IMAVIS.2011.11.008
B. Fasel, Juergen Luettin, Automatic facial expression analysis: a survey Pattern Recognition. ,vol. 36, pp. 259- 275 ,(2003) , 10.1016/S0031-3203(02)00052-3
Dmitri Bitouk, Ragini Verma, Ani Nenkova, Class-level spectral features for emotion recognition Speech Communication. ,vol. 52, pp. 613- 625 ,(2010) , 10.1016/J.SPECOM.2010.02.010
Florian Eyben, Martin Wöllmer, Björn Schuller, Opensmile Proceedings of the international conference on Multimedia - MM '10. pp. 1459- 1462 ,(2010) , 10.1145/1873951.1874246
Björn Schuller, Michel Valster, Florian Eyben, Roddy Cowie, Maja Pantic, AVEC 2012: the continuous audio/visual emotion challenge international conference on multimodal interfaces. pp. 449- 456 ,(2012) , 10.1145/2388676.2388776
Mihalis A. Nicolaou, Hatice Gunes, Maja Pantic, Output-associative RVM regression for dimensional and continuous emotion prediction ieee international conference on automatic face gesture recognition. pp. 16- 23 ,(2011) , 10.1109/FG.2011.5771396
Chul Min Lee, S.S. Narayanan, Toward detecting emotions in spoken dialogs IEEE Transactions on Speech and Audio Processing. ,vol. 13, pp. 293- 303 ,(2005) , 10.1109/TSA.2004.838534