作者: Haoqi Fan , Jiatong Zhou
关键词: Knowledge extraction 、 Cognition 、 Machine learning 、 Multimodal learning 、 Deep learning 、 Question answering 、 Task analysis 、 Spatial intelligence 、 Closed captioning 、 Computer science 、 Artificial intelligence
摘要: Attention has shown to be a pivotal development in deep learning and been used for multitude of multimodal tasks such as visual question answering image captioning. In this work, we pinpoint the potential limitations design traditional attention model. We identify that 1) current mechanisms discard latent information from intermediate reasoning, losing positional already captured by heatmaps 2) stacked attention, common way improve spatial may have suboptimal performance because vanishing gradient problem. introduce novel architecture address these problems, which all configuration contained reasoning process is retained pathway convolutional layers. show new leads substantial improvements multiple tasks, including achieving single model without using external knowledge comparable state-of-the-art on VQA dataset, well clear gains captioning task.