Temporal Context Aggregation for Video Retrieval with Contrastive Learning

作者: Xiangyang Xue , Bingchen Zhao , Xin Wen , Jie Shao

DOI:

关键词:

摘要: The current research focus on Content-Based Video Retrieval requires higher-level video representation describing the long-range semantic dependencies of relevant incidents, events, etc. However, existing methods commonly process frames a as individual images or short clips, making modeling difficult. In this paper, we propose TCA (Temporal Context Aggregation for Retrieval), learning framework that incorporates temporal information between frame-level features using self-attention mechanism. To train it retrieval datasets, supervised contrastive method performs automatic hard negative mining and utilizes memory bank mechanism to increase capacity samples. Extensive experiments are conducted multiple tasks, such CC_WEB_VIDEO, FIVR-200K, EVVE. proposed shows significant performance advantage (~17% mAP FIVR-200K) over state-of-the-art with video-level features, deliver competitive results 22x faster inference time comparing features.

参考文章(59)
Diederik P. Kingma, Jimmy Ba, Adam: A Method for Stochastic Optimization arXiv: Learning. ,(2014)
G. Csurka, Visual categorization with bags of keypoints european conference on computer vision. ,vol. 1, pp. 22- ,(2004)
Herbert Bay, Tinne Tuytelaars, Luc Van Gool, SURF: speeded up robust features european conference on computer vision. ,vol. 1, pp. 404- 417 ,(2006) , 10.1007/11744023_32
Hervé Jégou, Ondřej Chum, Negative Evidences and Co-occurences in Image Retrieval: The Benefit of PCA and Whitening Computer Vision – ECCV 2012. pp. 774- 787 ,(2012) , 10.1007/978-3-642-33709-3_55
Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Trevor Darrell, Kate Saenko, Long-term recurrent convolutional networks for visual recognition and description computer vision and pattern recognition. pp. 2625- 2634 ,(2015) , 10.1109/CVPR.2015.7298878
Yu-Gang Jiang, Chong-Wah Ngo, Jun Yang, Towards optimal bag-of-features for object categorization and semantic video retrieval conference on image and video retrieval. pp. 494- 501 ,(2007) , 10.1145/1282280.1282352
Herve Jegou, Matthijs Douze, Cordelia Schmid, Patrick Perez, Aggregating local descriptors into a compact image representation computer vision and pattern recognition. pp. 3304- 3311 ,(2010) , 10.1109/CVPR.2010.5540039
Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, Jiebo Luo, Effective Multiple Feature Hashing for Large-Scale Near-Duplicate Video Retrieval IEEE Transactions on Multimedia. ,vol. 15, pp. 1997- 2008 ,(2013) , 10.1109/TMM.2013.2271746
Sepp Hochreiter, Jürgen Schmidhuber, Long short-term memory Neural Computation. ,vol. 9, pp. 1735- 1780 ,(1997) , 10.1162/NECO.1997.9.8.1735
Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, Richang Hong, Multiple feature hashing for real-time large scale near-duplicate video retrieval acm multimedia. pp. 423- 432 ,(2011) , 10.1145/2072298.2072354