Site-Wide Wrapper Induction for Life Science Deep Web Databases

作者: Saqib Mir , Steffen Staab , Isabel Rojas

DOI: 10.1007/978-3-642-02879-3_9

关键词:

摘要: We present a novel approach to automatic information extraction from Deep Web Life Science databases using wrapper induction. Traditional induction techniques focus on learning wrappers based examples one class of pages, i.e. pages that are all similar in structure and content. Thereby, traditional targets the understanding generated database same generation template as observed example set. However, sites typically contain structurally diverse web multiple classes making problem more challenging. Furthermore, we such do not just provide mere data, but they also tend schema terms data labels --- giving further cues for solving site wrapping task. Our solution this challenge Site-Wide consists sequence steps: 1. classification into classes, 2. discovery these 3. each class. thus allows us perform unsupervised retrieval across an entire site. test our algorithm against three real-world biochemical deep sources report preliminary results, which very promising.

参考文章(35)
Paolo Merialdo, Valter Crescenzi, Giansalvatore Mecca, Improving the expressiveness of ROADRUNNER. SEBD. pp. 62- 69 ,(2004)
Yanhong Zhai, Bing Liu, Automatic wrapper generation using tree matching and partial tree alignment national conference on artificial intelligence. pp. 1687- 1690 ,(2006)
Jiying Wang, Ji-Rong Wen, Fred Lochovsky, Wei-Ying Ma, Instance-based schema matching for web databases by domain-specific query probing very large data bases. pp. 408- 419 ,(2004) , 10.1016/B978-012088469-8.50038-3
Hai He, Weiyi Meng, Clement Yu, Zonghuan Wu, Wise-integrator: an automatic integrator of web search interfaces for E-commerce very large data bases. pp. 357- 368 ,(2003) , 10.1016/B978-012722442-8/50039-2
Nicholas Kushmerick, Daniel S. Weld, Wrapper induction for information extraction international joint conference on artificial intelligence. pp. 729- 737 ,(1997)
Minoru Kanehisa, The KEGG database. Novartis Foundation symposium. ,vol. 247, pp. 91- 103 ,(2002) , 10.1002/0470857897.CH8
Bin He, Tao Tao, Kevin Chen-Chuan Chang, Organizing structured web sources by query schemas: a clustering approach conference on information and knowledge management. pp. 22- 31 ,(2004) , 10.1145/1031171.1031178
S. Chakrabarti, B.E. Dom, S.R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, J. Kleinberg, Mining the Web's link structure Computer. ,vol. 32, pp. 60- 67 ,(1999) , 10.1109/2.781636
Kevin Chen-Chuan Chang, Bin He, Zhen Zhang, Mining semantics for large scale integration on the web: evidences, insights, and challenges Sigkdd Explorations. ,vol. 6, pp. 67- 76 ,(2004) , 10.1145/1046456.1046465
Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, Juliana S. Teixeira, A brief survey of web data extraction tools ACM SIGMOD Record. ,vol. 31, pp. 84- 93 ,(2002) , 10.1145/565117.565137