作者: Kate Saenko , Subhashini Venugopalan , Raymond Mooney , Sergio Guadarrama , Jesse Thomason
DOI:
关键词:
摘要: This paper integrates techniques in natural language processing and computer vision to improve recognition description of entities activities real-world videos. We propose a strategy for generating textual descriptions videos by using factor graph combine visual detections with statistics. use state-of-the-art systems obtain confidences on entities, activities, scenes present the video. Our model combines these detection probabilistic knowledge mined from text corpora estimate most likely subject, verb, object, place. Results YouTube show that our approach improves both joint latent, diverse sentence components some individual when compared system alone, as well over previous n-gram language-modeling approach. The allows us automatically generate more accurate, richer sentential wide array possible content.