Learning to extract text-based information from the World Wide Web

作者： Stephen Soderland

DOI:

关键词:

摘要: There is a wealth of information to be mined from narrative text on the World Wide Web. Unfortunately, standard natural language processing (NLP) extraction techniques expect full, grammatical sentences, and perform poorly choppy sentence fragments that are often found web pages. This paper1 introduces Webfoot, preprocessor parses pages into logically coherent segments based page layout cues. Output Webfoot then passed CRYSTAL, an NLP system learns rules example. CRYSTAL transform formal representation equivalent relational database entries. This necessary first step for knowledge discovery other automated analysis free text.

aaai.org 本地加速

uni-trier.de 本地加速

aaai.org PDF 下载加速

参考文章(9)

Wendy Lehnert, Jonathan Aseltine, David Fisher, Stephen Soderland, CRYSTAL inducing a conceptual dictionary international joint conference on artificial intelligence. pp. 1314- 1319 ,(1995)

Stephen Glenn Soderland, Learning text analysis rules for domain-specific natural language processing University of Massachusetts. ,(1996)

Nicholas Kushmerick, Daniel S. Weld, Wrapper induction for information extraction international joint conference on artificial intelligence. pp. 729- 737 ,(1997)

Ryszard S. Michalski, A theory and methodology of inductive learning Computer Compacts. ,vol. 1, pp. 49- ,(1983) , 10.1016/0167-7136(83)90132-4

Ralph Grishman, The NYU system for MUC-6 or where's the syntax? Proceedings of the 6th conference on Message understanding - MUC6 '95. pp. 167- 175 ,(1995) , 10.3115/1072399.1072415

Robert B. Doorenbos, Oren Etzioni, Daniel S. Weld, A scalable comparison-shopping agent for the World-Wide Web adaptive agents and multi-agents systems. pp. 39- 48 ,(1997) , 10.1145/267658.267666

George R. Krupka, SRA Proceedings of the 6th conference on Message understanding - MUC6 '95. pp. 221- 235 ,(1995) , 10.3115/1072399.1072419

Peter Clark, Tim Niblett, The CN2 Induction Algorithm Machine Learning. ,vol. 3, pp. 261- 283 ,(1989) , 10.1023/A:1022641700528

Damaris Ayuso, Sean Boisen, Heidi Fox, Herb Gish, Robert Ingria, Ralph Weischedel, BBN Proceedings of the 4th conference on Message understanding - MUC4 '92. pp. 169- 176 ,(1992) , 10.3115/1072064.1072091

Learning to extract text-based information from the World Wide Web

来源期刊

我的账户

Learning to extract text-based information from the World Wide Web

来源期刊

相似文章 10

我的账户