Deep Attribute-preserving Metric Learning for Natural Language Object Retrieval

作者: Yunchao Wei , Xiaodan Liang , Fang Zhao , Jianshu Li , Tingfa Xu

DOI: 10.1145/3123266.3123439

关键词:

摘要: Retrieving image content with a natural language expression is an emerging interdisciplinary problem at the intersection of multimedia, processing and artificial intelligence. Existing methods tackle this challenging by learning features from visual linguistic domains independently while critical semantic correlations bridging two have been under-explored in feature process. In paper, we propose to exploit sharable attributes as "anchors" ensure learned are well aligned across for better object retrieval. We define "attributes" common concepts that informative retrieval can be easily both expression. particular, diverse complex (e.g., location, color, category, interaction between context) modeled incorporated promote cross-domain alignment multiple perspectives. Based on attributes, deep Attribute-Preserving Metric (AP-Metric) framework jointly generates unique query-sensitive region proposals conducts novel cross-modal explicitly pursues consistency over attribute abstraction within metric learning. Benefiting correlations, our proposed localize objects match query expressions cluttered background accurately. The overall end-to-end trainable. Extensive evaluations popular datasets including ReferItGame, RefCOCO, RefCOCO+ demonstrate its superiority. Notably, it achieves state-of-the-art performance ReferItGame dataset.

参考文章(46)
C. Lawrence Zitnick, Piotr Dollár, Edge Boxes: Locating Object Proposals from Edges Computer Vision – ECCV 2014. pp. 391- 405 ,(2014) , 10.1007/978-3-319-10602-1_26
M. Grubinger, The IAPR Benchmark : A New Evaluation Resource for Visual Information Systems language resources and evaluation. ,(2006)
Karen Livescu, Galen Andrew, Jeff Bilmes, Raman Arora, Deep Canonical Correlation Analysis international conference on machine learning. pp. 1247- 1255 ,(2013)
Ross Girshick, Fast R-CNN international conference on computer vision. pp. 1440- 1448 ,(2015) , 10.1109/ICCV.2015.169
Karen Simonyan, Andrew Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition computer vision and pattern recognition. ,(2014)
Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, Svetlana Lazebnik, Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models 2015 IEEE International Conference on Computer Vision (ICCV). pp. 2641- 2649 ,(2015) , 10.1109/ICCV.2015.303
Sergio Guadarrama, Rodner Erik, Kate Saenko, Ning Zhang, Ryan Farrell, Jeff Donahue, Trevor Darrell, Open-vocabulary Object Retrieval robotics science and systems. ,vol. 10, ,(2014) , 10.15607/RSS.2014.X.041
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, C. Lawrence Zitnick, Microsoft COCO: Common Objects in Context Computer Vision – ECCV 2014. pp. 740- 755 ,(2014) , 10.1007/978-3-319-10602-1_48
Andrej Karpathy, Li Fei-Fei, Deep visual-semantic alignments for generating image descriptions computer vision and pattern recognition. pp. 3128- 3137 ,(2015) , 10.1109/CVPR.2015.7298932
Alan L. Yuille, Liang-Chieh Chen, Iasonas Kokkinos, Kevin Murphy, George Papandreou, Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs arXiv: Computer Vision and Pattern Recognition. ,(2014)