Scalable Techniques for Clustering the Web.

作者: Piotr Indyk , Taher H. Haveliwala , Aristides Gionis

DOI:

关键词:

摘要: Clustering is one of the most crucial techniques for dealing with massive amount information present on web. can either be performed once offline, independent search queries, or online results queries. Our offline approach aims to efficiently cluster similar pages web, using technique Locality-Sensitive Hashing (LSH), in which web are hashed such a way that have much higher probability collision than dissimilar pages. preliminary experiments Stanford WebBase shown hash-based scheme scaled millions urls.

参考文章(15)
Hector Garcia-Molina, Narayanan Shivakumar, Detecting digital copyright violations on the internet ,(1999)
Piotr Indyk, Aristides Gionis, Rajeev Motwani, Similarity Search in High Dimensions via Hashing very large data bases. pp. 518- 529 ,(1999)
Gerard Salton, Michael J. McGill, Introduction to Modern Information Retrieval ,(1983)
Piotr Indyk, A small approximately min-wise independent family of hash functions symposium on discrete algorithms. ,vol. 38, pp. 454- 456 ,(1999) , 10.1006/JAGM.2000.1131
Hector Garcia-Molina, Rajeev Motwani, Narayanan Shivakumar, Jeffrey D. Ullman, Min Fang, Computing Iceberg Queries Efficiently very large data bases. pp. 299- 310 ,(1998)
E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J.D. Ullman, C. Yang, Finding interesting associations without support pruning IEEE Transactions on Knowledge and Data Engineering. ,vol. 13, pp. 64- 78 ,(2001) , 10.1109/69.908981
Dorit S. Hochbaum, David B. Shmoys, A Best Possible Heuristic for the k-Center Problem Mathematics of Operations Research. ,vol. 10, pp. 180- 184 ,(1985) , 10.1287/MOOR.10.2.180
Andrei Z Broder, Moses Charikar, Alan M Frieze, Michael Mitzenmacher, Min-Wise Independent Permutations symposium on the theory of computing. ,vol. 60, pp. 630- 659 ,(2000) , 10.1006/JCSS.1999.1690
M.F. Porter, An algorithm for suffix stripping Program: Electronic Library and Information Systems. ,vol. 40, pp. 313- 316 ,(1997) , 10.1108/EB046814
Oren Zamir, Oren Etzioni, Web document clustering: a feasibility demonstration international acm sigir conference on research and development in information retrieval. pp. 46- 54 ,(1998) , 10.1145/290941.290956