Supportive and Self Attentions for Image Caption

作者: Jen-Tzung Chien , Ting-An Lin

DOI:

关键词:

摘要: Attention over an observed image or natural sentence is run by spotting locating the region position of interest for pattern classification. The attention parameter seen as a latent variable, which was indirectly calculated minimizing classification loss. Using such mechanism, target information may not be correctly identified. Therefore, in addition to error, we can directly attend reconstruction error due supporting data. Our idea learn how through so-called supportive when available. A new mechanism developed conduct attentive learning translation invariance applied caption. derived helpful generating caption from input image. Moreover, this paper presents association network does only implement word-to-image attention, but also carry out image-to-image via self attention. relations between and text are sufficiently represented. Experiments on MS-COCO task show benefit proposed attentions with keyvalue memory network.

参考文章(24)
Tomas Mikolov, Martin Karafiát, Sanjeev Khudanpur, Jan Cernocký, Lukás Burget, Recurrent neural network based language model conference of the international speech communication association. pp. 1045- 1048 ,(2010)
Yoshua Bengio, Dmitriy Serdyuk, Jan Chorowski, Kyunghyun Cho, Dzmitry Bahdanau, Attention-based models for speech recognition neural information processing systems. ,vol. 28, pp. 577- 585 ,(2015)
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, Yoshua Bengio, None, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention international conference on machine learning. ,vol. 3, pp. 2048- 2057 ,(2015)
Karen Simonyan, Andrew Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition computer vision and pattern recognition. ,(2014)
Arthur Szlam, Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, End-to-end memory networks neural information processing systems. ,vol. 28, pp. 2440- 2448 ,(2015)
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, C. Lawrence Zitnick, Microsoft COCO: Common Objects in Context Computer Vision – ECCV 2014. pp. 740- 755 ,(2014) , 10.1007/978-3-319-10602-1_48
C. Lawrence Zitnick, Piotr Dollár, Ramakrishna Vedantam, Saurabh Gupta, Tsung-Yi Lin, Hao Fang, Xinlei Chen, Microsoft COCO Captions: Data Collection and Evaluation Server arXiv: Computer Vision and Pattern Recognition. ,(2015)
Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan, Show and tell: A neural image caption generator computer vision and pattern recognition. pp. 3156- 3164 ,(2015) , 10.1109/CVPR.2015.7298935
Dong Yu, Geoffrey Hinton, Nelson Morgan, Jen-Tzung Chien, Shigeki Sagayama, Introduction to the Special Section on Deep Learning for Speech and Language Processing IEEE Transactions on Audio, Speech, and Language Processing. ,vol. 20, pp. 4- 6 ,(2012) , 10.1109/TASL.2011.2173371
Sepp Hochreiter, Jürgen Schmidhuber, Long short-term memory Neural Computation. ,vol. 9, pp. 1735- 1780 ,(1997) , 10.1162/NECO.1997.9.8.1735