Learning to extract text-based information from the World Wide Web

作者: Stephen Soderland

DOI:

关键词:

摘要: There is a wealth of information to be mined from narrative text on the World Wide Web. Unfortunately, standard natural language processing (NLP) extraction techniques expect full, grammatical sentences, and perform poorly choppy sentence fragments that are often found web pages. This paper1 introduces Webfoot, preprocessor parses pages into logically coherent segments based page layout cues. Output Webfoot then passed CRYSTAL, an NLP system learns rules example. CRYSTAL transform formal representation equivalent relational database entries. This necessary first step for knowledge discovery other automated analysis free text.

参考文章(9)
Wendy Lehnert, Jonathan Aseltine, David Fisher, Stephen Soderland, CRYSTAL inducing a conceptual dictionary international joint conference on artificial intelligence. pp. 1314- 1319 ,(1995)
Stephen Glenn Soderland, Learning text analysis rules for domain-specific natural language processing University of Massachusetts. ,(1996)
Nicholas Kushmerick, Daniel S. Weld, Wrapper induction for information extraction international joint conference on artificial intelligence. pp. 729- 737 ,(1997)
Ryszard S. Michalski, A theory and methodology of inductive learning Computer Compacts. ,vol. 1, pp. 49- ,(1983) , 10.1016/0167-7136(83)90132-4
Ralph Grishman, The NYU system for MUC-6 or where's the syntax? Proceedings of the 6th conference on Message understanding - MUC6 '95. pp. 167- 175 ,(1995) , 10.3115/1072399.1072415
Robert B. Doorenbos, Oren Etzioni, Daniel S. Weld, A scalable comparison-shopping agent for the World-Wide Web adaptive agents and multi-agents systems. pp. 39- 48 ,(1997) , 10.1145/267658.267666
George R. Krupka, SRA Proceedings of the 6th conference on Message understanding - MUC6 '95. pp. 221- 235 ,(1995) , 10.3115/1072399.1072419
Peter Clark, Tim Niblett, The CN2 Induction Algorithm Machine Learning. ,vol. 3, pp. 261- 283 ,(1989) , 10.1023/A:1022641700528
Damaris Ayuso, Sean Boisen, Heidi Fox, Herb Gish, Robert Ingria, Ralph Weischedel, BBN Proceedings of the 4th conference on Message understanding - MUC4 '92. pp. 169- 176 ,(1992) , 10.3115/1072064.1072091