作者: René Vidal , Benjamín Béjar Haro , Effrosyni Mavroudi
DOI:
关键词:
摘要: Many problems in video understanding require labeling multiple activities occurring concurrently different parts of a video, including the objects and actors participating such activities. However, state-of-the-art methods computer vision focus primarily on tasks as action classification, detection, or segmentation, where typically only one label needs to be predicted. In this work, we propose generic approach classifying more nodes spatio-temporal graph grounded spatially localized semantic entities objects. particular, combine an attributed visual graph, which captures context interactions, with symbolic space, relationships between labels. We further neural message passing framework for jointly refining representations edges hybrid visual-symbolic graph. Our features a) node-type edge-type conditioned filters adaptive connectivity, b) soft-assignment module connecting vice versa, c) reasoning that enforces coherence d) pooling aggregating refined node edge downstream classification tasks. demonstrate generality our variety tasks, temporal subactivity object affordance CAD-120 dataset multilabel localization large scale Charades dataset, outperform existing deep learning approaches, using raw RGB frames.