Web-Prospector - An Automatic, Site-Wide Wrapper Induction Approach for Scientific Deep-Web Databases.

作者: Steffen Staab , Saqib Mir , Isabel Rojas

DOI:

关键词:

摘要: Wrapper induction techniques traditionally focus on learning wrappers based examples from one class of Web pages, i.e. pages that are all similar in structure and content. Thereby, traditional wrapper targets the understanding generated a database using same generation template as observed example set. Applying such to sites biological databases, however, we found there is need for wrapping structurally diverse web multiple classes making problem more challenging. Furthermore, scientific do not just provide mere data, but they also tend schema information terms data labels – giving further cues solving site task. In this paper present novel approach automatic extraction whole considers challenge takes advantage additional clues commonly available deep databases. The solution consists sequence steps: 1. classification into classes, 2. discovery these 3. each class. Our thus allows us perform unsupervised retrieval across an entire site. We test our algorithm against three real-world biochemical sources report preliminary results, which very promising.

参考文章(30)
Paolo Merialdo, Valter Crescenzi, Giansalvatore Mecca, Improving the expressiveness of ROADRUNNER. SEBD. pp. 62- 69 ,(2004)
Yanhong Zhai, Bing Liu, Automatic wrapper generation using tree matching and partial tree alignment national conference on artificial intelligence. pp. 1687- 1690 ,(2006)
Jiying Wang, Ji-Rong Wen, Fred Lochovsky, Wei-Ying Ma, Instance-based schema matching for web databases by domain-specific query probing very large data bases. pp. 408- 419 ,(2004) , 10.1016/B978-012088469-8.50038-3
Hai He, Weiyi Meng, Clement Yu, Zonghuan Wu, Wise-integrator: an automatic integrator of web search interfaces for E-commerce very large data bases. pp. 357- 368 ,(2003) , 10.1016/B978-012722442-8/50039-2
Nicholas Kushmerick, Daniel S. Weld, Wrapper induction for information extraction international joint conference on artificial intelligence. pp. 729- 737 ,(1997)
Minoru Kanehisa, The KEGG database. Novartis Foundation symposium. ,vol. 247, pp. 91- 103 ,(2002) , 10.1002/0470857897.CH8
Bin He, Tao Tao, Kevin Chen-Chuan Chang, Organizing structured web sources by query schemas: a clustering approach conference on information and knowledge management. pp. 22- 31 ,(2004) , 10.1145/1031171.1031178
Kevin Chen-Chuan Chang, Bin He, Zhen Zhang, Mining semantics for large scale integration on the web: evidences, insights, and challenges Sigkdd Explorations. ,vol. 6, pp. 67- 76 ,(2004) , 10.1145/1046456.1046465
Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, Juliana S. Teixeira, A brief survey of web data extraction tools ACM SIGMOD Record. ,vol. 31, pp. 84- 93 ,(2002) , 10.1145/565117.565137
Kevin Chen-Chuan Chang, Junghoo Cho, Accessing the web: from search to integration international conference on management of data. pp. 804- 805 ,(2006) , 10.1145/1142473.1142601