Translating Videos to Natural Language Using Deep Recurrent Neural Networks

作者: Kate Saenko , Subhashini Venugopalan , Jeff Donahue , Raymond Mooney , Marcus Rohrbach

DOI:

关键词: Object (computer science)Artificial neural networkComputer scienceRecurrent neural networkArtificial intelligenceNatural languageNatural language processingVocabularyDeep learningSentenceSymbol groundingMachine learning

摘要: Solving the visual symbol grounding problem has long been a goal of artificial intelligence. The field appears to be advancing closer to this goal with recent breakthroughs in deep …

参考文章(46)
Yoshihiko Gotoh, Muhammad Usman Ghani Khan, Describing Video Contents in Natural Language Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data. pp. 27- 35 ,(2012)
Jason J. Corso, Caiming Xiong, Ran Xu, Wei Chen, Jointly modeling deep video and compositional text to bridge vision and language in a unified framework national conference on artificial intelligence. pp. 2346- 2352 ,(2015)
Ryan Kiros, Ruslan Salakhutdinov, Richard S Zemel, None, Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models arXiv: Learning. ,(2014)
Ilya Sutskever, Wojciech Zaremba, Learning to Execute arXiv: Neural and Evolutionary Computing. ,(2014)
Manfred Pinkal, Bernt Schiele, Anna Rohrbach, Marcus Rohrbach, Marcus Rohrbach, Wei Qiu, Wei Qiu, Annemarie Friedrich, Coherent Multi-sentence Video Description with Variable Level of Detail german conference on pattern recognition. pp. 184- 195 ,(2014) , 10.1007/978-3-319-11752-2_15
Atsuhiro Kojima, Takeshi Tamura, Kunio Fukunaga, Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions International Journal of Computer Vision. ,vol. 50, pp. 171- 184 ,(2002) , 10.1023/A:1020346032608
Matthew D. Zeiler, Rob Fergus, Visualizing and Understanding Convolutional Networks european conference on computer vision. pp. 818- 833 ,(2014) , 10.1007/978-3-319-10590-1_53
Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, David Forsyth, Every Picture Tells a Story: Generating Sentences from Images Computer Vision – ECCV 2010. pp. 15- 29 ,(2010) , 10.1007/978-3-642-15561-1_2
Philipp Koehn, Kevin Knight, Statistical Machine Translation ,(2010)
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollar, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, Geoffrey Zweig, From captions to visual concepts and back computer vision and pattern recognition. pp. 1473- 1482 ,(2015) , 10.1109/CVPR.2015.7298754