作者: Arman Savran , Houwei Cao , Miraj Shah , Ani Nenkova , Ragini Verma
关键词:
摘要: We present experiments on fusing facial video, audio and lexical indicators for affect estimation during dyadic conversations. use temporal statistics of texture descriptors extracted from a combination various acoustic features, features to create regression based estimators each modality. The single modality regressors are then combined using particle filtering, by treating these independent outputs as measurements the states in Bayesian filtering framework, where previous observations provide prediction about current state means learned dynamics. Tested Audio-visual Emotion Recognition Challenge dataset, our achieve substantially higher scores than official baseline method every dimension affect. Our filtering-based multi-modality fusion achieves correlation performance 0.344 (baseline: 0.136) 0.280 0.096) fully continuous word level sub challenges, respectively.