Cross-media Structured Common Space for Multimedia Event Extraction

作者: Shih-Fu Chang , Di Lu , Manling Li , Heng Ji , Alireza Zareian

DOI:

关键词: ModalitiesArgument (linguistics)Event (computing)Benchmark (computing)Space (commercial competition)Computer scienceTask (project management)AnnotationEmbeddingMultimedia

摘要: We introduce a new task, MultiMedia Event Extraction (M2E2), which aims to extract events and their arguments from multimedia documents. develop the first benchmark collect dataset of 245 news articles with extensively annotated arguments. propose novel method, Weakly Aligned Structured Embedding (WASE), that encodes structured representations semantic information textual visual data into common embedding space. The structures are aligned across modalities by employing weakly supervised training strategy, enables exploiting available resources without explicit cross-media annotation. Compared uni-modal state-of-the-art methods, our approach achieves 4.0% 9.8% absolute F-score gains on text event argument role labeling extraction. unstructured representations, we achieve 8.3% 5.0% extraction labeling, respectively. By utilizing images, 21.4% more mentions than traditional text-only methods.

参考文章(65)
Amir Roshan Zamir, Khurram Soomro, Mubarak Shah, UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild arXiv: Computer Vision and Pattern Recognition. ,(2012)
Alejandro H. Toselli, Verónica Romero, Enrique Vidal, Viterbi Based Alignment between Text Images and their Transcripts sighum workshop on language technology for cultural heritage social sciences and humanities. pp. 9- 16 ,(2007)
Karen Simonyan, Andrew Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition computer vision and pattern recognition. ,(2014)
Ellen Riloff, Ruihong Huang, Bootstrapped Training of Event Extraction Classifiers conference of the european chapter of the association for computational linguistics. pp. 286- 295 ,(2012)
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, Juan Carlos Niebles, ActivityNet: A large-scale video benchmark for human activity understanding computer vision and pattern recognition. pp. 961- 970 ,(2015) , 10.1109/CVPR.2015.7298698
Yifan Zhang, Changsheng Xu, Yong Rui, Jinqiao Wang, Hanqing Lu, Semantic Event Extraction from Basketball Games using Multi-Modal Analysis international conference on multimedia and expo. pp. 2190- 2193 ,(2007) , 10.1109/ICME.2007.4285119
Charles J Fillmore, Christopher R Johnson, Miriam RL Petruck, Background to Framenet International Journal of Lexicography. ,vol. 16, pp. 235- 250 ,(2003) , 10.1093/IJL/16.3.235
Guangnan Ye, Yitong Li, Hongliang Xu, Dong Liu, Shih-Fu Chang, EventNet: A Large Scale Structured Concept Library for Complex Event Detection in Video acm multimedia. pp. 471- 480 ,(2015) , 10.1145/2733373.2806221
George A. Miller, WordNet Communications of the ACM. ,vol. 38, pp. 39- 41 ,(1995) , 10.1145/219717.219748