作者: John Foley , Michael Bendersky , Vanja Josifovski
关键词:
摘要: The goal of this work is extraction and retrieval local events from web pages. Examples include small venue concerts, theater performances, garage sales, movie screenings, etc. We collect these in the form retrievable calendar entries that structured information about event name, date, time location. Between existing techniques availability on social media semantic technologies, there are numerous ways to commercial, high-profile events. However, most require domain-level supervision, which not attainable at scale. Similarly, while adoption has grown, will always be organizations without resources or expertise add machine-readable annotations their Therefore, our approach bootstraps explicit massively scale up extraction. propose a novel model uses distant supervision assign scores individual fields (event location) structural algorithm optimally group into records. Our integrates both entire source document its relevant sub-regions, highly scalable. evaluate all 700 million documents large publicly available corpus, ClueWeb12. Using 217,000 unique explicitly annotated as we able double recall with 85% precision quadruple it 65% precision, no additional human supervision. also show can bootstrapped for fully supervised approach, further improve by 30%. In addition, geographic coverage extracted find significant increase geo-diversity compared annotations, maintaining high levels.