Database Similarity Join for Metric Spaces

作者: Yasin N. Silva , Spencer S. Pearson , Jason A. Cheney

DOI: 10.1007/978-3-642-41062-8_27

关键词: JoinsData miningData processingDatabaseInformation retrievalSort-merge joinMathematicsOperator (computer programming)Metric spaceHash joinSimilarity (network science)Join (sigma algebra)

摘要: Similarity Joins are recognized among the most useful data processing and analysis operations. They retrieve all pairs whose distances smaller than a predefined threshold e. While several standalone implementations have been proposed, very little work has addressed implementation of Join as physical database operator. In this paper, we focus on study, design operator for any dataset that lies in metric space DBSimJoin. We describe changes each query engine module to implement DBSimJoin provide details our PostgreSQL. The extensive performance evaluation shows significantly outperforms alternative approaches.

参考文章(19)
Raghav Kaushik, Surajit Chaudhuri, Venkatesh Ganti, Data Debugger: An Operator-Centric Approach for Data Quality Solutions. IEEE Data(base) Engineering Bulletin. ,vol. 29, pp. 60- 66 ,(2006)
Vlastislav Dohnal, Claudio Gennaro, Pavel Zezula, Similarity Join in Metric Spaces Using eD-Index database and expert systems applications. pp. 484- 493 ,(2003) , 10.1007/978-3-540-45227-0_48
Vlastislav Dohnal, Claudio Gennaro, Pasquale Savino, Pavel Zezula, Similarity Join in Metric Spaces Lecture Notes in Computer Science. pp. 452- 467 ,(2003) , 10.1007/3-540-36618-0_32
Marianne Lykke, Birger Larsen, Haakon Lund, Peter Ingwersen, Developing a Test Collection for the Evaluation of Integrated Search Lecture Notes in Computer Science. pp. 627- 630 ,(2010) , 10.1007/978-3-642-12275-0_63
Yasin N. Silva, Ahmed M. Aly, Walid G. Aref, Per-Ake Larson, SimDB Proceedings of the 2010 international conference on Management of data - SIGMOD '10. pp. 1243- 1246 ,(2010) , 10.1145/1807167.1807330
Jens-Peter Dittrich, Bernhard Seeger, GESS: a scalable similarity-join algorithm for mining large data sets in high dimensional spaces knowledge discovery and data mining. pp. 47- 56 ,(2001) , 10.1145/502512.502524
Gisli R. Hjaltason, Hanan Samet, Index-driven similarity search in metric spaces (Survey Article) ACM Transactions on Database Systems. ,vol. 28, pp. 517- 580 ,(2003) , 10.1145/958942.958948
Yasin N. Silva, Spencer Pearson, Exploiting database similarity joins for metric spaces Proceedings of the VLDB Endowment. ,vol. 5, pp. 1922- 1925 ,(2012) , 10.14778/2367502.2367538
Edwin H. Jacox, Hanan Samet, Metric space similarity joins ACM Transactions on Database Systems. ,vol. 33, pp. 1- 38 ,(2008) , 10.1145/1366102.1366104
Yasin N. Silva, Walid G. Aref, Per-Ake Larson, Spencer S. Pearson, Mohamed H. Ali, Similarity queries: their conceptual evaluation, transformations, and processing very large data bases. ,vol. 22, pp. 395- 420 ,(2013) , 10.1007/S00778-012-0296-4