作者: Chunxiao Liu , Zhendong Mao , An-An Liu , Tianzhu Zhang , Bin Wang
关键词:
摘要: Learning semantic correspondence between image and text is significant as it bridges the gap vision language. The key challenge to accurately find correlate shared semantics in text. Most existing methods achieve this goal by representing a weighted combination of all fragments (image regions or words), where relevant obtain more attention, otherwise less. However, despite ones contribute semantic, irrelevant will less disturb it, thus lead misalignment correlation phase. To address issue, we present novel Bidirectional Focal Attention Network (BFAN), which not only allows attend but also diverts attention into these concentrate on them. main difference with works they mostly focus learning weight while our BFAN eliminating from semantic. focal achieved preassigning based inter-modality relation, identifying intra-modality relation reassigning attention. Furthermore, jointly applied both image-to-text text-to-image directions, enables avoid preference long complex image. Experiments show simple effective framework significantly outperforms state-of-the-art, relative Recall@1 gains 2.2% Flicr30K MSCOCO benchmarks.