作者: Wei Liu , Wenhao Jiang , Yong Xu , Lin Ma , Jingwen Wang
DOI:
关键词:
摘要: Dense video captioning is a newly emerging task that aims at both localizing and describing all events in video. We identify tackle two challenges on this task, namely, (1) how to utilize past future contexts for accurate event proposal predictions, (2) construct informative input the decoder generating natural descriptions. First, previous works predominantly generate temporal proposals forward direction, which neglects context. propose bidirectional method effectively exploits make predictions. Second, different ending (nearly) same time are indistinguishable works, resulting captions. solve problem by representing each with an attentive fusion of hidden states from module contents (e.g., C3D features). further novel context gating mechanism balance contributions current its surrounding dynamically. empirically show our attentively fused representation superior or alone. By coupling modules into one unified framework, model outperforms state-of-the-arts ActivityNet Captions dataset relative gain over 100% (Meteor score increases 4.82 9.65).