作者: Yunchao Wei , Xiaodan Liang , Fang Zhao , Jianshu Li , Tingfa Xu
关键词:
摘要: Retrieving image content with a natural language expression is an emerging interdisciplinary problem at the intersection of multimedia, processing and artificial intelligence. Existing methods tackle this challenging by learning features from visual linguistic domains independently while critical semantic correlations bridging two have been under-explored in feature process. In paper, we propose to exploit sharable attributes as "anchors" ensure learned are well aligned across for better object retrieval. We define "attributes" common concepts that informative retrieval can be easily both expression. particular, diverse complex (e.g., location, color, category, interaction between context) modeled incorporated promote cross-domain alignment multiple perspectives. Based on attributes, deep Attribute-Preserving Metric (AP-Metric) framework jointly generates unique query-sensitive region proposals conducts novel cross-modal explicitly pursues consistency over attribute abstraction within metric learning. Benefiting correlations, our proposed localize objects match query expressions cluttered background accurately. The overall end-to-end trainable. Extensive evaluations popular datasets including ReferItGame, RefCOCO, RefCOCO+ demonstrate its superiority. Notably, it achieves state-of-the-art performance ReferItGame dataset.