ViSiL: Fine-grained Spatio-Temporal Video Similarity Learning

作者: Ioannis Kompatsiaris , Ioannis Patras , Symeon Papadopoulos , Giorgos Kordopatis-Zilos

DOI:

关键词:

摘要: In this paper we introduce ViSiL, a Video Similarity Learning architecture that considers fine-grained Spatio-Temporal relations between pairs of videos -- such are typically lost in previous video retrieval approaches embed the whole frame or even into vector descriptor before similarity estimation. By contrast, our Convolutional Neural Network (CNN)-based approach is trained to calculate video-to-video from refined frame-to-frame matrices, so as consider both intra- and inter-frame relations. proposed method, pairwise estimated by applying Tensor Dot (TD) followed Chamfer (CS) on regional CNN features - avoids feature aggregation calculation frames. Subsequently, matrix all frames fed four-layer CNN, then summarized using score captures temporal patterns matching sequences. We train network triplet loss scheme evaluate it five public benchmark datasets four different problems where demonstrate large improvements comparison state art. The implementation ViSiL publicly available.

参考文章(33)
Diederik P. Kingma, Jimmy Ba, Adam: A Method for Stochastic Optimization arXiv: Learning. ,(2014)
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri, Learning Spatiotemporal Features with 3D Convolutional Networks 2015 IEEE International Conference on Computer Vision (ICCV). pp. 4489- 4497 ,(2015) , 10.1109/ICCV.2015.510
H. G. Barrow, J. M. Tenenbaum, R. C. Bolles, H. C. Wolf, Parametric correspondence and chamfer matching: two new techniques for image matching international joint conference on artificial intelligence. pp. 659- 663 ,(1977)
Hervé Jégou, Ondřej Chum, Negative Evidences and Co-occurences in Image Retrieval: The Benefit of PCA and Whitening Computer Vision – ECCV 2012. pp. 774- 787 ,(2012) , 10.1007/978-3-642-33709-3_55
Karen Simonyan, Andrew Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition computer vision and pattern recognition. ,(2014)
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, Juan Carlos Niebles, ActivityNet: A large-scale video benchmark for human activity understanding computer vision and pattern recognition. pp. 961- 970 ,(2015) , 10.1109/CVPR.2015.7298698
Sébastien Poullot, Shunsuke Tsukatani, Anh Phuong Nguyen, Hervé Jégou, Shin'Ichi Satoh, Temporal Matching Kernel with Explicit Feature Maps acm multimedia. pp. 381- 390 ,(2015) , 10.1145/2733373.2806228
Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, Richang Hong, Multiple feature hashing for real-time large scale near-duplicate video retrieval acm multimedia. pp. 423- 432 ,(2011) , 10.1145/2072298.2072354
Hung-Khoon Tan, Chong-Wah Ngo, Richard Hong, Tat-Seng Chua, Scalable detection of partial near-duplicate videos by visual-temporal consistency acm multimedia. pp. 145- 154 ,(2009) , 10.1145/1631272.1631295
Matthijs Douze, Jerome Revaud, Cordelia Schmid, Herve Jegou, Stable Hyper-pooling and Query Expansion for Event Detection international conference on computer vision. pp. 1825- 1832 ,(2013) , 10.1109/ICCV.2013.229