Searching for Hidden-Web Databases

作者: Juliana Freire , Luciano Barbosa

DOI:

关键词:

摘要: Recently, there has been increased interest in the retrieval and integration of hidden-Web data with a view to leverage high-quality information available online databases. Although previous works have addressed many aspects actual integration, including matching form schemata automatically filling out forms, problem locating relevant sources largely overlooked. Given dynamic nature Web, where areconstantlychanging, itiscrucialtoautomaticallydiscoverthese resources. However, considering number documents on Web (Google already indexes over 8 billion documents), finding tens, hundreds or even thousands forms that are task is really like looking for few needles haystack. Besides, since vocabulary structure given domain unknown until actually found, it hard define exactly what look for. We propose new crawling strategy locate databases which aims achieve balance between two conflicting requirements this problem: need perform broad search while at same time avoiding crawl large irrelevant pages. The proposed does by focusing topic; judiciously choosing links follow within topic more likely lead pages contain forms; employing appropriate stopping criteria. describe algorithms underlying an experimental evaluation shows our approach both effective efficient, leading larger numbers retrieved as function visited than other crawlers.

参考文章(16)
Kevin Chen Chuan Chang, Zhen Zhang, Bin He, Toward large scale integration: Building a MetaQuerier over databases on the Web conference on innovative data systems research. pp. 44- 55 ,(2005)
JRA McCallum, Jason Rennie, Using Reinforcement Learning to Spider the Web Efficiently international conference on machine learning. pp. 335- 343 ,(1999)
Soumen Chakrabarti, Martin van den Berg, Byron Dom, Focused crawling: a new approach to topic-specific Web resource discovery the web conference. ,vol. 31, pp. 1623- 1640 ,(1999) , 10.1016/S1389-1286(99)00052-3
Kevin Chen-Chuan Chang, Bin He, Chengkai Li, Mitesh Patel, Zhen Zhang, Structured databases on the web: observations and implications international conference on management of data. ,vol. 33, pp. 61- 70 ,(2004) , 10.1145/1031570.1031584
Krishna Bharat, Andrei Broder, Monika Henzinger, Puneet Kumar, Suresh Venkatasubramanian, The connectivity server: fast access to linkage information on the Web the web conference. ,vol. 30, pp. 469- 477 ,(1998) , 10.1016/S0169-7552(98)80047-0
Soumen Chakrabarti, Kunal Punera, Mallela Subramanyam, None, Accelerated focused crawling through online relevance feedback the web conference. pp. 148- 159 ,(2002) , 10.1145/511446.511466
Hai He, Weiyi Meng, Clement Yu, Zonghuan Wu, Automatic integration of Web search interfaces with WISE-Integrator very large data bases. ,vol. 13, pp. 256- 273 ,(2004) , 10.1007/S00778-004-0126-4
Michael K. Bergman, White Paper: The Deep Web: Surfacing Hidden Value Journal of Electronic Publishing. ,vol. 7, ,(2001) , 10.3998/3336451.0007.104
Stefan Siersdorfer, Gerhard Weikum, Jens Graupmann, Michael Biwer, Patrick Zimmer, Martin Theobald, Sergej Sizov, The BINGO! System for Information Portal Generation and Expert Web Search conference on innovative data systems research. pp. 69- 80 ,(2003)
Bin He, Kevin Chen-Chuan Chang, Statistical schema matching across web query interfaces international conference on management of data. pp. 217- 228 ,(2003) , 10.1145/872757.872784