The Evolution of the Web and Implications for an Incremental Crawler

作者: Hector Garcia-Molina , Junghoo Cho

DOI:

关键词:

摘要: In this paper we study how to build an effective incremental crawler. The crawler selectively and incrementally updates its index and/or local collection of web pages, instead periodically refreshing the in batch mode. can improve ``freshness'' significantly bring new pages a more timely manner. We first present results from experiment conducted on than half million over 4 months, estimate evolve time. Based these experimental results, compare various design choices for discuss their trade-offs. propose architecture crawler, which combines best choices.

参考文章(13)
Jon M. Kleinberg, Authoritative sources in a hyperlinked environment symposium on discrete algorithms. pp. 668- 677 ,(1998) , 10.5555/314613.315045
Junghoo Cho, Hector Garcia-Molina, Synchronizing a database to improve freshness international conference on management of data. ,vol. 29, pp. 117- 128 ,(2000) , 10.1145/335191.335391
E. G. Coffman, Zhen Liu, Richard R. Weber, Optimal Robot Scheduling for Web Search Engines Journal of Scheduling. ,vol. 1, pp. 15- 29 ,(1998) , 10.1002/(SICI)1099-1425(199806)1:1<15::AID-JOS3>3.0.CO;2-K
Junghoo Cho, Hector Garcia-Molina, Lawrence Page, Efficient crawling through URL ordering the web conference. ,vol. 30, pp. 161- 172 ,(1998) , 10.1016/S0169-7552(98)00108-1
Sergey Brin, Lawrence Page, The anatomy of a large-scale hypertextual Web search engine the web conference. ,vol. 30, pp. 107- 117 ,(1998) , 10.1016/S0169-7552(98)00110-X
Craig E Wills, Mikhail Mikhailov, Towards a better understanding of Web resources and server responses for improved caching the web conference. ,vol. 31, pp. 1231- 1243 ,(1999) , 10.1016/S1389-1286(99)00037-7
James Pitkow, Peter Pirolli, Life, death, and lawfulness on the electronic frontier human factors in computing systems. pp. 383- 390 ,(1997) , 10.1145/258549.258805
Marshall K McKusick, William N Joy, Samuel J Leffler, Robert S Fabry, None, A fast file system for UNIX ACM Transactions on Computer Systems. ,vol. 2, pp. 181- 197 ,(1984) , 10.1145/989.990
Michael Frey, H. M. Taylor, S. Karlin, An introduction to stochastic modeling Journal of the American Statistical Association. ,vol. 36, pp. 428- ,(1985) , 10.2307/1269970
Soumen Chakrabarti, Martin van den Berg, Byron Dom, Focused crawling: a new approach to topic-specific Web resource discovery the web conference. ,vol. 31, pp. 1623- 1640 ,(1999) , 10.1016/S1389-1286(99)00052-3