Leveraging constraints for deduplication

作者: Anish Das Sarma , Surajit Chaudhuri , Shriraghav Kaushik , Venkatesh Ganti

DOI:

关键词: SQLData deduplicationData recordsTheoretical computer sciencePartition (database)restrictConstraint satisfactionTupleData miningMathematics

摘要: A deduplication algorithm that provides improved accuracy in data by using aggregate and/or groupwise constraints. Deduplication is accomplished only as many of these constraints are satisfied rather than be imposed inflexibly hard Additionally, textual similarity between tuples leveraged to restrict the search space. The begins with a coarse initial partition records and continues raising threshold until splits given partition. This sequence defines rich space alternatives. Over this space, an finds input maximizes constraint satisfaction. In context aggregation for all SQL (structured query language) aggregates allowed, including summation.

参考文章(40)
Anthony K. H. Tung, Jiawei Han, Laks V.S. Lakshmanan, Raymond T. Ng, Constraint-based clustering in large databases international conference on database theory. pp. 405- 419 ,(2001) , 10.1007/3-540-44503-X_26
Rohit Ananthakrishna, Surajit Chaudhuri, Venkatesh Ganti, Eliminating fuzzy duplicates in data warehouses very large data bases. pp. 586- 597 ,(2002) , 10.1016/B978-155860869-6/50058-5
Peter D. Rail, Rene L. Alejandro, Duplicate record detection ,(1995)
AnHai Doan, Warren Shen, Xin Li, Constraint-based entity matching national conference on artificial intelligence. pp. 862- 867 ,(2005)
Panagiotis Ipeirotis, Nikolaos Koudas, Luis Gravano, Divesh Srivastava, Text joins for data cleansing and integration in a relational database management system ,(2004)