作者: Junghoo Cho , Uri Schonfeld
DOI:
关键词:
摘要: Crawling algorithms have been the subject of extensive research and optimizations, but some important questions remain open. In particular, given infinite number pages available on Web, search-engine operators constantly struggle with following vexing questions: When can I stop downloading Web? How many should download to cover “most” know am not missing an part when stop? this paper we provide answer these by developing a family crawling that (1) theoretical guarantee how much “important” Web it will after certain (2) give high priority during crawl, so search engine index most first. We prove correctness our analysis evaluate their performance experimentally based 141 million URLs obtained from Web. Our experiments demonstrate even simple algorithm is effective in early provides “coverage” relatively small pages.