Neural Message Passing on Hybrid Spatio-Temporal Visual and Symbolic Graphs for Video Understanding

作者: René Vidal , Benjamín Béjar Haro , Effrosyni Mavroudi

DOI:

关键词:

摘要: Many problems in video understanding require labeling multiple activities occurring concurrently different parts of a video, including the objects and actors participating such activities. However, state-of-the-art methods computer vision focus primarily on tasks as action classification, detection, or segmentation, where typically only one label needs to be predicted. In this work, we propose generic approach classifying more nodes spatio-temporal graph grounded spatially localized semantic entities objects. particular, combine an attributed visual graph, which captures context interactions, with symbolic space, relationships between labels. We further neural message passing framework for jointly refining representations edges hybrid visual-symbolic graph. Our features a) node-type edge-type conditioned filters adaptive connectivity, b) soft-assignment module connecting vice versa, c) reasoning that enforces coherence d) pooling aggregating refined node edge downstream classification tasks. demonstrate generality our variety tasks, temporal subactivity object affordance CAD-120 dataset multilabel localization large scale Charades dataset, outperform existing deep learning approaches, using raw RGB frames.

参考文章(69)
Diederik P. Kingma, Jimmy Ba, Adam: A Method for Stochastic Optimization arXiv: Learning. ,(2014)
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri, Learning Spatiotemporal Features with 3D Convolutional Networks 2015 IEEE International Conference on Computer Vision (ICCV). pp. 4489- 4497 ,(2015) , 10.1109/ICCV.2015.510
Guilhem Cheron, Ivan Laptev, Cordelia Schmid, P-CNN: Pose-Based CNN Features for Action Recognition 2015 IEEE International Conference on Computer Vision (ICCV). pp. 3218- 3226 ,(2015) , 10.1109/ICCV.2015.368
Yu-Gang Jiang, Zuxuan Wu, Jun Wang, Xiangyang Xue, Shih-Fu Chang, Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks IEEE Transactions on Pattern Analysis and Machine Intelligence. ,vol. 40, pp. 352- 364 ,(2018) , 10.1109/TPAMI.2017.2670560
Marcin Marszałek, Cordelia Schmid, Constructing Category Hierarchies for Visual Recognition european conference on computer vision. ,vol. 5305, pp. 479- 491 ,(2008) , 10.1007/978-3-540-88693-8_35
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, C. Lawrence Zitnick, Microsoft COCO: Common Objects in Context Computer Vision – ECCV 2014. pp. 740- 755 ,(2014) , 10.1007/978-3-319-10602-1_48
Vignesh Ramanathan, Congcong Li, Jia Deng, Wei Han, Zhen Li, Kunlong Gu, Yang Song, Samy Bengio, Chuck Rossenberg, Li Fei-Fei, Learning semantic relationships for better action retrieval in images computer vision and pattern recognition. pp. 1100- 1109 ,(2015) , 10.1109/CVPR.2015.7298713
Xiaoyang Wang, Qiang Ji, Video event recognition with deep hierarchical context model computer vision and pattern recognition. pp. 4418- 4427 ,(2015) , 10.1109/CVPR.2015.7299071
Yang Zhou, Bingbing Ni, Richang Hong, Meng Wang, Qi Tian, Interaction part mining: A mid-level approach for fine-grained action recognition computer vision and pattern recognition. pp. 3323- 3331 ,(2015) , 10.1109/CVPR.2015.7298953
Myung Jin Choi, Joseph J. Lim, Antonio Torralba, Alan S. Willsky, Exploiting hierarchical context on a large database of object categories computer vision and pattern recognition. pp. 129- 136 ,(2010) , 10.1109/CVPR.2010.5540221