Jointly Modeling Embedding and Translation to Bridge Video and Language

作者: Yingwei Pan , Tao Mei , Ting Yao , Houqiang Li , Yong Rui

DOI: 10.1109/CVPR.2016.497

关键词:

摘要: Automatically describing video content with natural language is a fundamental challenge of computer vision. Re-current Neural Networks (RNNs), which models sequence dynamics, has attracted increasing attention on visual interpretation. However, most existing approaches generate word locally the given previous words and content, while relationship between sentence semantics not holistically exploited. As result, generated sentences may be contextually correct but (e.g., subjects, verbs or objects) are true. This paper presents novel unified framework, named Long Short-Term Memory visual-semantic Embedding (LSTM-E), can simultaneously explore learning LSTM embedding. The former aims to maximize probability generating next latter create embedding space for enforcing entire content. experiments YouTube2Text dataset show that our proposed LSTM-E achieves to-date best published performance in sentences: 45.3% 31.0% terms BLEU@4 METEOR, respectively. Superior performances also reported two movie description datasets (M-VAD MPII-MD). In addition, we demonstrate outperforms several state-of-the-art techniques predicting Subject-Verb-Object (SVO) triplets.

参考文章(69)
Kongwah Wan, Joo Hwee Lim, Lingyu Duan, Xin Guo Yu, Qi Tian, Changsheng Xu, Annotation of video footage and personalised video generation ,(2005)
Hamid Hatami-Hanza, System and Method of Content Generation ,(2010)
Jason J. Corso, Caiming Xiong, Ran Xu, Wei Chen, Jointly modeling deep video and compositional text to bridge vision and language in a unified framework national conference on artificial intelligence. pp. 2346- 2352 ,(2015)
C. V. Jawahar, Yashaswi Verma, Ankush Gupta, Choosing linguistics over vision to describe images national conference on artificial intelligence. pp. 606- 612 ,(2012)
Tomas Mikolov, Geoffrey G. Zweig, Alejandro Acero, Feature-augmented neural networks and applications of same ,(2013)
Shahin M. Mowzoon, Automated Video Captioning ,(2010)
Ilya Sutskever, James Martens, Learning Recurrent Neural Networks with Hessian-Free Optimization international conference on machine learning. pp. 1033- 1040 ,(2011)
Hugo Larochelle, Atousa Torabi, Li Yao, Christopher Joseph Pal, Aaron C. Courville, Kyunghyun Cho, Nicolas Ballas, Video Description Generation Incorporating Spatio-Temporal Features and a Soft-Attention Mechanism arXiv: Machine Learning. ,(2015)
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, Yoshua Bengio, None, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention international conference on machine learning. ,vol. 3, pp. 2048- 2057 ,(2015)
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri, Learning Spatiotemporal Features with 3D Convolutional Networks 2015 IEEE International Conference on Computer Vision (ICCV). pp. 4489- 4497 ,(2015) , 10.1109/ICCV.2015.510