作者: Xiangyang Xue , Bingchen Zhao , Xin Wen , Jie Shao
DOI:
关键词:
摘要: The current research focus on Content-Based Video Retrieval requires higher-level video representation describing the long-range semantic dependencies of relevant incidents, events, etc. However, existing methods commonly process frames a as individual images or short clips, making modeling difficult. In this paper, we propose TCA (Temporal Context Aggregation for Retrieval), learning framework that incorporates temporal information between frame-level features using self-attention mechanism. To train it retrieval datasets, supervised contrastive method performs automatic hard negative mining and utilizes memory bank mechanism to increase capacity samples. Extensive experiments are conducted multiple tasks, such CC_WEB_VIDEO, FIVR-200K, EVVE. proposed shows significant performance advantage (~17% mAP FIVR-200K) over state-of-the-art with video-level features, deliver competitive results 22x faster inference time comparing features.