Efficient exact set-similarity joins

作者: Raghav Kaushik , Arvind Arasu , Venkatesh Ganti

DOI: 10.5555/1182635.1164206

关键词:

摘要: Given two input collections of sets, a set-similarity join (SSJoin) identifies all pairs one from each collection, that have high similarity. Recent work has identified SSJoin as useful primitive operator in data cleaning. In this paper, we propose new algorithms for SSJoin. Our important features: They are exact, i.e., they always produce the correct answer, and carry precise performance guarantees. We believe our first to both features; previous with guarantees only probabilistically approximate. demonstrate effectiveness using thorough experimental evaluation over real-life synthetic sets.

参考文章(24)
Piotr Indyk, Taher H. Haveliwala, Aristides Gionis, Scalable Techniques for Clustering the Web. WebDB (Informal Proceedings). pp. 129- 134 ,(2000)
Piotr Indyk, Aristides Gionis, Rajeev Motwani, Similarity Search in High Dimensions via Hashing very large data bases. pp. 518- 529 ,(1999)
Manikandan Narayanan, Richard M. Karp, Gapped Local Similarity Search with Provable Guarantees workshop on algorithms in bioinformatics. pp. 74- 86 ,(2004) , 10.1007/978-3-540-30219-3_7
Rohit Ananthakrishna, Surajit Chaudhuri, Venkatesh Ganti, Eliminating fuzzy duplicates in data warehouses very large data bases. pp. 586- 597 ,(2002) , 10.1016/B978-155860869-6/50058-5
r;ribeiro-neto bueza-yates (b), Modern Information Retrieval ,(1999)
Jeffrey F. Naughton, Jignesh M. Patel, Raghav Kaushik, Karthikeyan Ramasamy, Set Containment Joins: The Good, The Bad and The Ugly very large data bases. pp. 351- 362 ,(2000)
Surajit Chaudhuri, Rajeev Motwani, Vivek Narasayya, On random sampling over joins ACM SIGMOD Record. ,vol. 28, pp. 263- 274 ,(1999) , 10.1145/304181.304206
Krishna Bharat, Andrei Broder, Mirror, mirror on the Web: a study of host pairs with replicated content the web conference. ,vol. 31, pp. 1579- 1590 ,(1999) , 10.1016/S1389-1286(99)00021-3
Sergey Melnik, Hector Garcia-Molina, Adaptive algorithms for set containment joins ACM Transactions on Database Systems. ,vol. 28, pp. 56- 99 ,(2003) , 10.1145/762471.762474