作者: Juliana Freire , Luciano Barbosa
DOI:
关键词:
摘要: Recently, there has been increased interest in the retrieval and integration of hidden-Web data with a view to leverage high-quality information available online databases. Although previous works have addressed many aspects actual integration, including matching form schemata automatically filling out forms, problem locating relevant sources largely overlooked. Given dynamic nature Web, where areconstantlychanging, itiscrucialtoautomaticallydiscoverthese resources. However, considering number documents on Web (Google already indexes over 8 billion documents), finding tens, hundreds or even thousands forms that are task is really like looking for few needles haystack. Besides, since vocabulary structure given domain unknown until actually found, it hard define exactly what look for. We propose new crawling strategy locate databases which aims achieve balance between two conflicting requirements this problem: need perform broad search while at same time avoiding crawl large irrelevant pages. The proposed does by focusing topic; judiciously choosing links follow within topic more likely lead pages contain forms; employing appropriate stopping criteria. describe algorithms underlying an experimental evaluation shows our approach both effective efficient, leading larger numbers retrieved as function visited than other crawlers.