Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching

作者: Chunxiao Liu , Zhendong Mao , An-An Liu , Tianzhu Zhang , Bin Wang

DOI: 10.1145/3343031.3350869

关键词:

摘要: Learning semantic correspondence between image and text is significant as it bridges the gap vision language. The key challenge to accurately find correlate shared semantics in text. Most existing methods achieve this goal by representing a weighted combination of all fragments (image regions or words), where relevant obtain more attention, otherwise less. However, despite ones contribute semantic, irrelevant will less disturb it, thus lead misalignment correlation phase. To address issue, we present novel Bidirectional Focal Attention Network (BFAN), which not only allows attend but also diverts attention into these concentrate on them. main difference with works they mostly focus learning weight while our BFAN eliminating from semantic. focal achieved preassigning based inter-modality relation, identifying intra-modality relation reassigning attention. Furthermore, jointly applied both image-to-text text-to-image directions, enables avoid preference long complex image. Experiments show simple effective framework significantly outperforms state-of-the-art, relative Recall@1 gains 2.2% Flicr30K MSCOCO benchmarks.

参考文章(28)
Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks IEEE Transactions on Pattern Analysis and Machine Intelligence. ,vol. 39, pp. 1137- 1149 ,(2017) , 10.1109/TPAMI.2016.2577031
Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, Svetlana Lazebnik, Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models 2015 IEEE International Conference on Computer Vision (ICCV). pp. 2641- 2649 ,(2015) , 10.1109/ICCV.2015.303
Thang Luong, Hieu Pham, Christopher D. Manning, Effective Approaches to Attention-based Neural Machine Translation empirical methods in natural language processing. pp. 1412- 1421 ,(2015) , 10.18653/V1/D15-1166
Andrej Karpathy, Li Fei-Fei, Deep visual-semantic alignments for generating image descriptions computer vision and pattern recognition. pp. 3128- 3137 ,(2015) , 10.1109/CVPR.2015.7298932
Lin Ma, Zhengdong Lu, Lifeng Shang, Hang Li, Multimodal Convolutional Neural Networks for Matching Image and Sentence 2015 IEEE International Conference on Computer Vision (ICCV). pp. 2623- 2631 ,(2015) , 10.1109/ICCV.2015.301
Andrej Karpathy, Armand Joulin, Fei Fei F Li, Deep Fragment Embeddings for Bidirectional Image Sentence Mapping neural information processing systems. ,vol. 27, pp. 1889- 1897 ,(2014)
Yoshua Bengio, Kyunghyun Cho, Dzmitry Bahdanau, Neural Machine Translation by Jointly Learning to Align and Translate arXiv: Computation and Language. ,(2014)
Dhruv Batra, Devi Parikh, Jiasen Lu, Jianwei Yang, Hierarchical Question-Image Co-Attention for Visual Question Answering arXiv: Computer Vision and Pattern Recognition. ,(2016)
Hyeonseob Nam, Jung-Woo Ha, Jeonghee Kim, Dual Attention Networks for Multimodal Reasoning and Matching computer vision and pattern recognition. pp. 2156- 2164 ,(2017) , 10.1109/CVPR.2017.232
Yan Huang, Wei Wang, Liang Wang, None, Instance-Aware Image and Sentence Matching with Selective Multimodal LSTM 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7254- 7262 ,(2017) , 10.1109/CVPR.2017.767