作者: Anish Das Sarma , Surajit Chaudhuri , Shriraghav Kaushik , Venkatesh Ganti
DOI:
关键词: SQL 、 Data deduplication 、 Data records 、 Theoretical computer science 、 Partition (database) 、 restrict 、 Constraint satisfaction 、 Tuple 、 Data mining 、 Mathematics
摘要: A deduplication algorithm that provides improved accuracy in data by using aggregate and/or groupwise constraints. Deduplication is accomplished only as many of these constraints are satisfied rather than be imposed inflexibly hard Additionally, textual similarity between tuples leveraged to restrict the search space. The begins with a coarse initial partition records and continues raising threshold until splits given partition. This sequence defines rich space alternatives. Over this space, an finds input maximizes constraint satisfaction. In context aggregation for all SQL (structured query language) aggregates allowed, including summation.