作者: Piotr Indyk , Taher H. Haveliwala , Aristides Gionis
DOI:
关键词:
摘要: Clustering is one of the most crucial techniques for dealing with massive amount information present on web. can either be performed once offline, independent search queries, or online results queries. Our offline approach aims to efficiently cluster similar pages web, using technique Locality-Sensitive Hashing (LSH), in which web are hashed such a way that have much higher probability collision than dissimilar pages. preliminary experiments Stanford WebBase shown hash-based scheme scaled millions urls.