作者: Pradipto Das , Chenliang Xu , Richard F. Doell , Jason J. Corso
关键词: Image stitching 、 Object (computer science) 、 Topic model 、 Semantics 、 Language model 、 Video tracking 、 Computer science 、 Object detection 、 Natural language 、 Artificial intelligence 、 Natural language processing
摘要: The problem of describing images through natural language has gained importance in the computer vision community. Solutions to image description have either focused on a top-down approach generating combinations object detections and models or bottom-up propagation keyword tags from training test probabilistic nearest neighbor techniques. In contrast, videos with is less studied problem. this paper, we combine ideas approaches propose method for video that captures most relevant contents description. We hybrid system consisting low level multimodal latent topic model initial annotation, middle concept detectors high module produce final lingual descriptions. compare results our human descriptions both short long forms two datasets, demonstrate output greater agreement than any single level.