RankMass Crawler: A Crawler with High PageRank Coverage Guarantee.

作者: Junghoo Cho , Uri Schonfeld

DOI:

关键词:

摘要: Crawling algorithms have been the subject of extensive research and optimizations, but some important questions remain open. In particular, given infinite number pages available on Web, search-engine operators constantly struggle with following vexing questions: When can I stop downloading Web? How many should download to cover “most” know am not missing an part when stop? this paper we provide answer these by developing a family crawling that (1) theoretical guarantee how much “important” Web it will after certain (2) give high priority during crawl, so search engine index most first. We prove correctness our analysis evaluate their performance experimentally based 141 million URLs obtained from Web. Our experiments demonstrate even simple algorithm is effective in early provides “coverage” relatively small pages.

参考文章(26)
David J. DeWitt, Yuan Wang, Computing PageRank in a Distributed Internet Search Engine System. very large data bases. pp. 420- 431 ,(2004)
Martin Ester, Hans-Peter Kriegel, Matthias Schubert, Accurate and efficient crawling for relevant websites very large data bases. pp. 396- 407 ,(2004) , 10.1016/B978-012088469-8.50037-1
Zoltán Gyöngyi, Hector Garcia-Molina, Jan Pedersen, Combating web spam with trustrank very large data bases. pp. 576- 587 ,(2004) , 10.1016/B978-012088469-8.50052-8
Marc Najork, Janet L. Wiener, Breadth-first crawling yields high-quality pages Proceedings of the tenth international conference on World Wide Web - WWW '01. pp. 114- 118 ,(2001) , 10.1145/371920.371965
Amy N Langville, Carl D Meyer, Deeper Inside PageRank Internet Mathematics. ,vol. 1, pp. 335- 380 ,(2004) , 10.1080/15427951.2004.10129091
Soumen Chakrabarti, Kunal Punera, Mallela Subramanyam, None, Accelerated focused crawling through online relevance feedback the web conference. pp. 148- 159 ,(2002) , 10.1145/511446.511466
Junghoo Cho, Hector Garcia-Molina, Lawrence Page, Efficient crawling through URL ordering the web conference. ,vol. 30, pp. 161- 172 ,(1998) , 10.1016/S0169-7552(98)00108-1
J. L. Wolf, M. S. Squillante, P. S. Yu, J. Sethuraman, L. Ozsen, Optimal crawling strategies for web search engines the web conference. pp. 136- 147 ,(2002) , 10.1145/511446.511465
Sergey Brin, Lawrence Page, The anatomy of a large-scale hypertextual Web search engine the web conference. ,vol. 30, pp. 107- 117 ,(1998) , 10.1016/S0169-7552(98)00110-X
Glen Jeh, Jennifer Widom, Scaling personalized web search Proceedings of the twelfth international conference on World Wide Web - WWW '03. pp. 271- 279 ,(2003) , 10.1145/775152.775191