Methods for domain-independent information extraction from the web: an experimental comparison

作者: Michael Cafarella , Oren Etzioni , Daniel S. Weld , Tal Shaked , Stephen Soderland

DOI:

关键词: Computer scienceWorld Wide WebRecallInformation extractionDomain (software engineering)Process (engineering)Class (computer programming)

摘要: Our KNOWITALL system aims to automate the tedious process of extracting large collections facts (e.g., names scientists or politicians) from Web in an autonomous, domain-independent, and scalable manner. In its first major run, extracted over 50,000 with high precision, but suggested a challenge: How can we improve KNOWITALL's recall extraction rate without sacrificing precision? This paper presents three distinct ways address this challenge evaluates their performance. Rule Learning learns domain-specific rules. Subclass Extraction automatically identifies sub-classes order boost recall. List locates lists class instances, "wrapper" for each list, extracts elements list. Since method bootstraps domain-independent methods, no hand-labeled training examples are required. Experiments show relative coverage demonstrate synergy. concert, our methods gave 4-fold 19-fold increase recall, while maintaining discovered 10,300 cities missing Tipster Gazetteer.

参考文章(26)
Ratanachai Sombatsrisomboon, Mitsuru Ishizuka, Yutaka Matsuo, Acquisition of Hypernyms and Hyponyms from the WWW ,(2003)
Alexander Maedche, Steffen Staab, Learning ontologies for the semantic web international semantic web conference. pp. 51- 60 ,(2001)
Oren Etzioni, Moving up the information food chain: deploying softbots on the world wide web national conference on artificial intelligence. pp. 1322- 1326 ,(1996)
Ellen Riloff, Rosie Jones, Learning dictionaries for information extraction by multi-level bootstrapping national conference on artificial intelligence. pp. 474- 479 ,(1999)
Sergey Brin, Extracting Patterns and Relations from the World Wide Web Lecture Notes in Computer Science. pp. 172- 183 ,(1999) , 10.1007/10704656_11
Nicholas Kushmerick, Daniel S. Weld, Wrapper induction for information extraction international joint conference on artificial intelligence. pp. 729- 737 ,(1997)
Peter D. Turney, Mining the web for synonyms: PMI-IR versus LSA on TOEFL european conference on machine learning. pp. 491- 502 ,(2001) , 10.1007/3-540-44795-4_42
Andrew Kachites McCallum, Dayne Freitag, Information Extraction with HMMs and Shrinkage ,(1999)
William Phillips, Ellen Riloff, Exploiting Strong Syntactic Heuristics and Co-Training to Learn Semantic Lexicons empirical methods in natural language processing. pp. 125- 132 ,(2002) , 10.3115/1118693.1118710
Robert B. Doorenbos, Oren Etzioni, Daniel S. Weld, A scalable comparison-shopping agent for the World-Wide Web adaptive agents and multi-agents systems. pp. 39- 48 ,(1997) , 10.1145/267658.267666