作者: Ioannis Kompatsiaris , Ioannis Patras , Symeon Papadopoulos , Giorgos Kordopatis-Zilos
DOI:
关键词:
摘要: In this paper we introduce ViSiL, a Video Similarity Learning architecture that considers fine-grained Spatio-Temporal relations between pairs of videos -- such are typically lost in previous video retrieval approaches embed the whole frame or even into vector descriptor before similarity estimation. By contrast, our Convolutional Neural Network (CNN)-based approach is trained to calculate video-to-video from refined frame-to-frame matrices, so as consider both intra- and inter-frame relations. proposed method, pairwise estimated by applying Tensor Dot (TD) followed Chamfer (CS) on regional CNN features - avoids feature aggregation calculation frames. Subsequently, matrix all frames fed four-layer CNN, then summarized using score captures temporal patterns matching sequences. We train network triplet loss scheme evaluate it five public benchmark datasets four different problems where demonstrate large improvements comparison state art. The implementation ViSiL publicly available.