Learning to Extract Local Events from the Web

作者: John Foley , Michael Bendersky , Vanja Josifovski

DOI: 10.1145/2766462.2767739

关键词:

摘要: The goal of this work is extraction and retrieval local events from web pages. Examples include small venue concerts, theater performances, garage sales, movie screenings, etc. We collect these in the form retrievable calendar entries that structured information about event name, date, time location. Between existing techniques availability on social media semantic technologies, there are numerous ways to commercial, high-profile events. However, most require domain-level supervision, which not attainable at scale. Similarly, while adoption has grown, will always be organizations without resources or expertise add machine-readable annotations their Therefore, our approach bootstraps explicit massively scale up extraction. propose a novel model uses distant supervision assign scores individual fields (event location) structural algorithm optimally group into records. Our integrates both entire source document its relevant sub-regions, highly scalable. evaluate all 700 million documents large publicly available corpus, ClueWeb12. Using 217,000 unique explicitly annotated as we able double recall with 85% precision quadruple it 65% precision, no additional human supervision. also show can bootstrapped for fully supervised approach, further improve by 30%. In addition, geographic coverage extracted find significant increase geo-diversity compared annotations, maintaining high levels.

参考文章(39)
Adriel Dean-Hall, Ellen Voorhees, Jaap Kamps, Charles L Clarke, Paul Thomas, Nicole Simone, Overview of the TREC 2013 contextual suggestion track text retrieval conference. ,(2013)
Max Schmachtenberg, Christian Bizer, Heiko Paulheim, Adoption of the Linked Data Best Practices in Different Topical Domains The Semantic Web – ISWC 2014. pp. 245- 260 ,(2014) , 10.1007/978-3-319-11964-9_16
Eduard Hovy, Congxing Cai, Donald Metzler, Structured Event Retrieval over Microblog Archives north american chapter of the association for computational linguistics. pp. 646- 655 ,(2012)
Christian Bizer, Kai Eckert, Robert Meusel, Hannes Mühleisen, Michael Schuhmacher, Johanna Völker, Deployment of RDFa, Microdata, and Microformats on the Web A Quantitative Analysis international semantic web conference. pp. 17- 32 ,(2013) , 10.1007/978-3-642-41338-4_2
F. Jelinek, Interpolated estimation of Markov source parameters from sparse data Proc. Workshop on Pattern Recognition in Practice, 1980. pp. 381- 397 ,(1980)
Hila Becker, Luis Gravano, Identification and characterization of events in social media Columbia University. ,(2011) , 10.7916/D8VM4QVD
Ziqi Zhang, Towards Efficient and Effective Semantic Table Interpretation The Semantic Web – ISWC 2014. pp. 487- 502 ,(2014) , 10.1007/978-3-319-11964-9_31
Jannik Strötgen, Michael Gertz, HeidelTime: High Quality Rule-Based Extraction and Normalization of Temporal Expressions meeting of the association for computational linguistics. pp. 321- 324 ,(2010)
Marco D. Adelfio, Hanan Samet, Schema extraction for tabular data on the web Proceedings of the VLDB Endowment. ,vol. 6, pp. 421- 432 ,(2013) , 10.14778/2536336.2536343