Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning

作者: Wei Liu , Wenhao Jiang , Yong Xu , Lin Ma , Jingwen Wang

DOI:

关键词:

摘要: Dense video captioning is a newly emerging task that aims at both localizing and describing all events in video. We identify tackle two challenges on this task, namely, (1) how to utilize past future contexts for accurate event proposal predictions, (2) construct informative input the decoder generating natural descriptions. First, previous works predominantly generate temporal proposals forward direction, which neglects context. propose bidirectional method effectively exploits make predictions. Second, different ending (nearly) same time are indistinguishable works, resulting captions. solve problem by representing each with an attentive fusion of hidden states from module contents (e.g., C3D features). further novel context gating mechanism balance contributions current its surrounding dynamically. empirically show our attentively fused representation superior or alone. By coupling modules into one unified framework, model outperforms state-of-the-arts ActivityNet Captions dataset relative gain over 100% (Meteor score increases 4.82 9.65).

参考文章(42)
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, Yoshua Bengio, None, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention international conference on machine learning. ,vol. 3, pp. 2048- 2057 ,(2015)
Diederik P. Kingma, Jimmy Ba, Adam: A Method for Stochastic Optimization arXiv: Learning. ,(2014)
Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, Yong Rui, Jointly Modeling Embedding and Translation to Bridge Video and Language computer vision and pattern recognition. pp. 4594- 4602 ,(2016) , 10.1109/CVPR.2016.497
Manfred Pinkal, Bernt Schiele, Anna Rohrbach, Marcus Rohrbach, Marcus Rohrbach, Wei Qiu, Wei Qiu, Annemarie Friedrich, Coherent Multi-sentence Video Description with Variable Level of Detail german conference on pattern recognition. pp. 184- 195 ,(2014) , 10.1007/978-3-319-11752-2_15
Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, Alexander G. Hauptmann, DevNet: A Deep Event Network for multimedia event detection and evidence recounting computer vision and pattern recognition. pp. 2568- 2577 ,(2015) , 10.1109/CVPR.2015.7298872
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, Juan Carlos Niebles, ActivityNet: A large-scale video benchmark for human activity understanding computer vision and pattern recognition. pp. 961- 970 ,(2015) , 10.1109/CVPR.2015.7298698
Zhiheng Huang, Haonan Yu, Yi Yang, Jiang Wang, Wei Xu, Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks arXiv: Computer Vision and Pattern Recognition. ,(2015)
Haojin Yang, Christoph Meinel, None, Content Based Lecture Video Retrieval Using Speech and Video Text Information IEEE Transactions on Learning Technologies. ,vol. 7, pp. 142- 154 ,(2014) , 10.1109/TLT.2014.2307305
Sepp Hochreiter, Jürgen Schmidhuber, Long short-term memory Neural Computation. ,vol. 9, pp. 1735- 1780 ,(1997) , 10.1162/NECO.1997.9.8.1735
Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu, BLEU Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL '02. pp. 311- 318 ,(2001) , 10.3115/1073083.1073135