作者: Michele Merler , Bert Huang , Lexing Xie , Gang Hua , Apostol Natsev
关键词:
摘要: We propose semantic model vectors, an intermediate level representation, as a basis for modeling and detecting complex events in unconstrained real-world videos, such those from YouTube. The vectors are extracted using set of discriminative classifiers, each being ensemble SVM models trained thousands labeled web images, total 280 generic concepts. Our study reveals that the proposed representation outperforms-and is complementary to-other low-level visual descriptors video event modeling. hence present end-to-end detection system, which combines with other static or dynamic descriptors, at frame, segment, full clip level. perform comprehensive empirical on 2010 TRECVID Multimedia Event Detection task (http://www.nist.gov/itl/iad/mig/med10.cfm), validates not only best individual descriptor, outperforming state-of-the-art global local features well spatio-temporal HOG HOF but also most compact. early late feature fusion across various approaches, leading to 15% performance boost overall system 0.46 mean average precision. In order promote further research this direction, we made our MED publicly available community use (http://www1.cs.columbia.edu/~mmerler/SMV.html).