Stacked Latent Attention for Multimodal Reasoning

作者: Haoqi Fan , Jiatong Zhou

DOI: 10.1109/CVPR.2018.00118

关键词: Knowledge extractionCognitionMachine learningMultimodal learningDeep learningQuestion answeringTask analysisSpatial intelligenceClosed captioningComputer scienceArtificial intelligence

摘要: Attention has shown to be a pivotal development in deep learning and been used for multitude of multimodal tasks such as visual question answering image captioning. In this work, we pinpoint the potential limitations design traditional attention model. We identify that 1) current mechanisms discard latent information from intermediate reasoning, losing positional already captured by heatmaps 2) stacked attention, common way improve spatial may have suboptimal performance because vanishing gradient problem. introduce novel architecture address these problems, which all configuration contained reasoning process is retained pathway convolutional layers. show new leads substantial improvements multiple tasks, including achieving single model without using external knowledge comparable state-of-the-art on VQA dataset, well clear gains captioning task.

参考文章(28)
Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, Sanja Fidler, None, Skip-thought vectors neural information processing systems. ,vol. 28, pp. 3294- 3302 ,(2015)
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, Yoshua Bengio, None, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention international conference on machine learning. ,vol. 3, pp. 2048- 2057 ,(2015)
Diederik P. Kingma, Jimmy Ba, Adam: A Method for Stochastic Optimization arXiv: Learning. ,(2014)
Yoshua Bengio, Xavier Glorot, Understanding the difficulty of training deep feedforward neural networks international conference on artificial intelligence and statistics. pp. 249- 256 ,(2010)
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, C. Lawrence Zitnick, Microsoft COCO: Common Objects in Context Computer Vision – ECCV 2014. pp. 740- 755 ,(2014) , 10.1007/978-3-319-10602-1_48
Andrej Karpathy, Li Fei-Fei, Deep visual-semantic alignments for generating image descriptions computer vision and pattern recognition. pp. 3128- 3137 ,(2015) , 10.1109/CVPR.2015.7298932
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh, VQA: Visual Question Answering 2015 IEEE International Conference on Computer Vision (ICCV). pp. 2425- 2433 ,(2015) , 10.1109/ICCV.2015.279
Arthur Szlam, Sainbayar Sukhbaatar, Yuandong Tian, Bolei Zhou, Rob Fergus, Simple Baseline for Visual Question Answering arXiv: Computer Vision and Pattern Recognition. ,(2015)
Jeffrey Pennington, Richard Socher, Christopher Manning, Glove: Global Vectors for Word Representation empirical methods in natural language processing. pp. 1532- 1543 ,(2014) , 10.3115/V1/D14-1162
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, Li Fei-Fei, Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations International Journal of Computer Vision. ,vol. 123, pp. 32- 73 ,(2017) , 10.1007/S11263-016-0981-7