作者: Michael Cafarella , Oren Etzioni , Daniel S. Weld , Tal Shaked , Stephen Soderland
DOI:
关键词: Computer science 、 World Wide Web 、 Recall 、 Information extraction 、 Domain (software engineering) 、 Process (engineering) 、 Class (computer programming)
摘要: Our KNOWITALL system aims to automate the tedious process of extracting large collections facts (e.g., names scientists or politicians) from Web in an autonomous, domain-independent, and scalable manner. In its first major run, extracted over 50,000 with high precision, but suggested a challenge: How can we improve KNOWITALL's recall extraction rate without sacrificing precision? This paper presents three distinct ways address this challenge evaluates their performance. Rule Learning learns domain-specific rules. Subclass Extraction automatically identifies sub-classes order boost recall. List locates lists class instances, "wrapper" for each list, extracts elements list. Since method bootstraps domain-independent methods, no hand-labeled training examples are required. Experiments show relative coverage demonstrate synergy. concert, our methods gave 4-fold 19-fold increase recall, while maintaining discovered 10,300 cities missing Tipster Gazetteer.