Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering

作者: Arun Mallya , Svetlana Lazebnik

DOI: 10.1007/978-3-319-46448-0_25

关键词: Computer scienceLearning modelsContext (language use)Multiple choiceTask (project management)Network modelQuestion answeringTransfer (computing)Machine learningObject (computer science)Artificial intelligence

摘要: This paper proposes deep convolutional network models that utilize local and global context to make human activity label predictions in still images, achieving state-of-the-art performance on two recent datasets with hundreds of labels each. We use multiple instance learning handle the lack supervision level individual person instances, weighted loss unbalanced training data. Further, we show how specialized features trained these can be used improve accuracy Visual Question Answering (VQA) task, form choice fill-in-the-blank questions (Visual Madlibs). Specifically, tackle types person-object relationship improvements over generic ImageNet classification task

参考文章(38)
Pulkit Agrawal, Ross Girshick, Jitendra Malik, Analyzing the Performance of Multilayer Neural Networks for Object Recognition european conference on computer vision. pp. 329- 344 ,(2014) , 10.1007/978-3-319-10584-0_22
Ilya Sutskever, Geoffrey E. Hinton, Alex Krizhevsky, ImageNet Classification with Deep Convolutional Neural Networks neural information processing systems. ,vol. 25, pp. 1097- 1105 ,(2012)
Cha Zhang, John C. Platt, Paul A. Viola, Multiple Instance Boosting for Object Detection neural information processing systems. ,vol. 18, pp. 1417- 1424 ,(2005)
Subhransu Maji, Lubomir Bourdev, Jitendra Malik, Action recognition from a distributed representation of pose and appearance CVPR 2011. pp. 3177- 3184 ,(2011) , 10.1109/CVPR.2011.5995631
Licheng Yu, Eunbyung Park, Alexander C. Berg, Tamara L. Berg, Visual Madlibs: Fill in the Blank Description Generation and Question Answering 2015 IEEE International Conference on Computer Vision (ICCV). pp. 2461- 2469 ,(2015) , 10.1109/ICCV.2015.283
Yu-Wei Chao, Zhan Wang, Yugeng He, Jiaxuan Wang, Jia Deng, HICO: A Benchmark for Recognizing Human-Object Interactions in Images international conference on computer vision. pp. 1017- 1025 ,(2015) , 10.1109/ICCV.2015.122
Gil Levi, Tal Hassner, Emotion Recognition in the Wild via Convolutional Neural Networks and Mapped Binary Patterns international conference on multimodal interfaces. pp. 503- 510 ,(2015) , 10.1145/2818346.2830587
Marcus Rohrbach, Trevor Darrell, Jacob Andreas, Dan Klein, Deep Compositional Question Answering with Neural Module Networks arXiv: Computer Vision and Pattern Recognition. ,(2015)
Sean Bell, C. Lawrence Zitnick, Kavita Bala, Ross Girshick, Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks computer vision and pattern recognition. pp. 2874- 2883 ,(2016) , 10.1109/CVPR.2016.314
Allan Jabri, Armand Joulin, Laurens van der Maaten, Revisiting Visual Question Answering Baselines european conference on computer vision. pp. 727- 739 ,(2016) , 10.1007/978-3-319-46484-8_44