SmartCrawl: a new strategy for the exploration of the hidden web

作者: Augusto de Carvalho Fontes , F�bio Soares Silva

DOI: 10.1145/1031453.1031457

关键词: HyperlinkWeb search queryWeb intelligenceWeb pageInformation retrievalWorld Wide WebSite mapWeb navigationComputer scienceWeb search engineWeb crawler

摘要: The way current search engines work leaves a large amount of information available in the World Wide Web outside their catalogues. This is due to fact that crawlers by following hyperlinks and few other references ignore HTML forms. In this paper, we propose engine prototype can retrieve behind forms automatically generating queries for them. We describe architecture, some implementation details an experiment proves not indexed engines.

参考文章(8)
Juliano Palmieri Lage, Altigran S. da Silva, Paulo B. Golgher, Alberto H. F. Laender, Collecting hidden weeb pages for data extraction Proceedings of the fourth international workshop on Web information and data management - WIDM '02. pp. 69- 75 ,(2002) , 10.1145/584931.584946
Sergey Brin, Lawrence Page, The anatomy of a large-scale hypertextual Web search engine the web conference. ,vol. 30, pp. 107- 117 ,(1998) , 10.1016/S0169-7552(98)00110-X
M. K. Bergman, The deep web : Surfacing hidden value J. Electronic Publishing, the University of Michigan. ,(2001)
King-Ip Lin, Hui Chen, Automatic information discovery from the "invisible Web" international conference on information technology coding and computing. pp. 332- 337 ,(2002) , 10.1109/ITCC.2002.1000411
Stephen W. Liddle, David W. Embley, Del T. Scott, Sai Ho Yau, Extracting Data behind Web Forms Lecture Notes in Computer Science. pp. 402- 413 ,(2003) , 10.1007/978-3-540-45275-1_35
V. Shkapenyuk, T. Suel, Design and implementation of a high-performance distributed Web crawler international conference on data engineering. pp. 357- 368 ,(2002) , 10.1109/ICDE.2002.994750
Hector Garcia-Molina, Sriram Raghavan, Crawling the Hidden Web very large data bases. pp. 129- 138 ,(2001)