Efficient exact set-similarity joins

作者： Raghav Kaushik , Arvind Arasu , Venkatesh Ganti

关键词:

摘要: Given two input collections of sets, a set-similarity join (SSJoin) identifies all pairs one from each collection, that have high similarity. Recent work has identified SSJoin as useful primitive operator in data cleaning. In this paper, we propose new algorithms for SSJoin. Our important features: They are exact, i.e., they always produce the correct answer, and carry precise performance guarantees. We believe our first to both features; previous with guarantees only probabilistically approximate. demonstrate effectiveness using thorough experimental evaluation over real-life synthetic sets.

freepatentsonline.com 本地加速

vldb.org PDF 下载加速

microsoft.com PDF 下载加速

acm.org LINK 下载加速

freepatentsonline.com LINK 下载加速

lens.org UNKNOWN 下载加速

freepatentsonline.com UNKNOWN 下载加速

sci-hub.st HTML 下载加速

参考文章(24)

Piotr Indyk, Taher H. Haveliwala, Aristides Gionis, Scalable Techniques for Clustering the Web. WebDB (Informal Proceedings). pp. 129- 134 ,(2000)

Hyoseop Shin, Partition-based high dimensional similarity join method ,(2003)

Piotr Indyk, Aristides Gionis, Rajeev Motwani, Similarity Search in High Dimensions via Hashing very large data bases. pp. 518- 529 ,(1999)

Manikandan Narayanan, Richard M. Karp, Gapped Local Similarity Search with Provable Guarantees workshop on algorithms in bioinformatics. pp. 74- 86 ,(2004) , 10.1007/978-3-540-30219-3_7

Rohit Ananthakrishna, Surajit Chaudhuri, Venkatesh Ganti, Eliminating fuzzy duplicates in data warehouses very large data bases. pp. 586- 597 ,(2002) , 10.1016/B978-155860869-6/50058-5

r;ribeiro-neto bueza-yates (b), Modern Information Retrieval ,(1999)

Jeffrey F. Naughton, Jignesh M. Patel, Raghav Kaushik, Karthikeyan Ramasamy, Set Containment Joins: The Good, The Bad and The Ugly very large data bases. pp. 351- 362 ,(2000)

Surajit Chaudhuri, Rajeev Motwani, Vivek Narasayya, On random sampling over joins ACM SIGMOD Record. ,vol. 28, pp. 263- 274 ,(1999) , 10.1145/304181.304206

Krishna Bharat, Andrei Broder, Mirror, mirror on the Web: a study of host pairs with replicated content the web conference. ,vol. 31, pp. 1579- 1590 ,(1999) , 10.1016/S1389-1286(99)00021-3

10.

Sergey Melnik, Hector Garcia-Molina, Adaptive algorithms for set containment joins ACM Transactions on Database Systems. ,vol. 28, pp. 56- 99 ,(2003) , 10.1145/762471.762474

Efficient exact set-similarity joins

来源期刊

我的账户

Efficient exact set-similarity joins

来源期刊

相似文章 10

我的账户