Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild

作者: Kate Saenko , Subhashini Venugopalan , Raymond Mooney , Sergio Guadarrama , Jesse Thomason

DOI:

关键词:

摘要: This paper integrates techniques in natural language processing and computer vision to improve recognition description of entities activities real-world videos. We propose a strategy for generating textual descriptions videos by using factor graph combine visual detections with statistics. use state-of-the-art systems obtain confidences on entities, activities, scenes present the video. Our model combines these detection probabilistic knowledge mined from text corpora estimate most likely subject, verb, object, place. Results YouTube show that our approach improves both joint latent, diverse sentence components some individual when compared system alone, as well over previous n-gram language-modeling approach. The allows us automatically generate more accurate, richer sentential wide array possible content.

参考文章(28)
Manfred Pinkal, Bernt Schiele, Anna Rohrbach, Marcus Rohrbach, Marcus Rohrbach, Wei Qiu, Wei Qiu, Annemarie Friedrich, Coherent Multi-sentence Video Description with Variable Level of Detail german conference on pattern recognition. pp. 184- 195 ,(2014) , 10.1007/978-3-319-11752-2_15
Girish Kulkarni, Tamara L. Berg, Yejin Choi, Siming Li, Alexander C. Berg, Composing Simple Image Descriptions using Web-scale N-grams conference on computational natural language learning. pp. 220- 228 ,(2011)
Yiannis Aloimonos, Hal Daume Iii, Yezhou Yang, Ching Teo, Corpus-Guided Sentence Generation of Natural Images empirical methods in natural language processing. pp. 444- 454 ,(2011)
Jia Deng, J. Krause, A. C. Berg, Li Fei-Fei, Hedging your bets: Optimizing accuracy-specificity trade-offs in large scale visual recognition computer vision and pattern recognition. pp. 3450- 3457 ,(2012) , 10.1109/CVPR.2012.6248086
Pradipto Das, Chenliang Xu, Richard F. Doell, Jason J. Corso, A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching computer vision and pattern recognition. pp. 2634- 2641 ,(2013) , 10.1109/CVPR.2013.340
Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, Antonio Torralba, SUN database: Large-scale scene recognition from abbey to zoo computer vision and pattern recognition. pp. 3485- 3492 ,(2010) , 10.1109/CVPR.2010.5539970
Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, Andrew Zisserman, The Pascal Visual Object Classes (VOC) Challenge International Journal of Computer Vision. ,vol. 88, pp. 303- 338 ,(2010) , 10.1007/S11263-009-0275-4
Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, Tamara L Berg, Baby talk: Understanding and generating simple image descriptions computer vision and pattern recognition. pp. 1601- 1608 ,(2011) , 10.1109/CVPR.2011.5995466
Raymond J. Mooney, Tanvi S. Motwani, Improving video activity recognition using object recognition and text mining european conference on artificial intelligence. pp. 600- 605 ,(2012) , 10.3233/978-1-61499-098-7-600