作者: Yingwei Pan , Tao Mei , Ting Yao , Houqiang Li , Yong Rui
关键词:
摘要: Automatically describing video content with natural language is a fundamental challenge of computer vision. Re-current Neural Networks (RNNs), which models sequence dynamics, has attracted increasing attention on visual interpretation. However, most existing approaches generate word locally the given previous words and content, while relationship between sentence semantics not holistically exploited. As result, generated sentences may be contextually correct but (e.g., subjects, verbs or objects) are true. This paper presents novel unified framework, named Long Short-Term Memory visual-semantic Embedding (LSTM-E), can simultaneously explore learning LSTM embedding. The former aims to maximize probability generating next latter create embedding space for enforcing entire content. experiments YouTube2Text dataset show that our proposed LSTM-E achieves to-date best published performance in sentences: 45.3% 31.0% terms BLEU@4 METEOR, respectively. Superior performances also reported two movie description datasets (M-VAD MPII-MD). In addition, we demonstrate outperforms several state-of-the-art techniques predicting Subject-Verb-Object (SVO) triplets.