作者: I-Hong Jhuo , Stavros Tsakalidis , Shiv N. Vitaladevuni , Vasant Manohar , Pradeep Natarajan
DOI:
关键词:
摘要: We describe the Raytheon BBN (BBN) VISER system that is designed to detect events of interest in multimedia data. also present a comprehensive analysis different modules context MED 2011 task. The incorporates large set low-level features capture appearance, color, motion, audio, and audio-visual cooccurrence patterns videos. For features, we rigorously analyzed several coding pooling strategies, used state-of-the-art spatio-temporal strategies model relationships between features. uses high-level (i.e., semantic) visual information obtained from detecting scene, object, action concepts. Furthermore, exploits multimodal by analyzing available spoken videotext content using BBN's Byblos automatic speech recognition (ASR) video text systems. These diverse streams are combined into single, fixed dimensional vector for each video. explored two combination strategies: early fusion late fusion. Early was implemented through fast kernel-based framework performed both Bayesian (BAYCOM) as well an innovative weighted-average framework. Consistent with previous MED’10 evaluation, exhibit strong performance form basis our system. However, speech, video-text, object detection provide consistent significant improvements. Overall, BBN’s exhibited best among all submitted systems average ANDC score 0.46 across 10 MED’11 test when threshold optimized NDC score, <30% missed rate minimize detections at 6% false alarm rate.