Adaptive Windows for Duplicate Detection

作者： Uwe Draisbach , Felix Naumann , Sascha Szott , Oliver Wonneberg

DOI: 10.1109/ICDE.2012.20

关键词: Artificial intelligence 、 Pattern recognition 、 Computer science 、 Sorting 、 Records management 、 Similarity measure 、 Partition (database) 、 Data set 、 Duplicate detection 、 Data mining

摘要: Duplicate detection is the task of identifying all groups records within a data set that represent same real-world entity, respectively. This difficult, because (i) representations might differ slightly, so some similarity measure must be defined to compare pairs and (ii) sets have high volume making pair-wise comparison infeasible. To tackle second problem, many algorithms been suggested partition record only each partition. One well-known such approach Sorted Neighborhood Method (SNM), which sorts according key then advances window over comparing appear window. We propose with Count Strategy (DCS) variation SNM uses varying size. It based on intuition there regions suggesting larger size lower smaller Next basic variant DCS, we also thoroughly evaluate called DCS++ provably better than original in terms efficiency (same results fewer comparisons).

参考文章(24)

Uwe Draisbach, Felix Naumann, A Comparison and Generalization of Blocking and Windowing Algorithms for Duplicate Detection ,(2009)

Lifang Gu, Rohan A. Baxter, Adaptive Filtering for Efficient Record Linkage. siam international conference on data mining. pp. 477- 481 ,(2004)

Peter Christen, Probabilistic data generation for deduplication and data linkage intelligent data engineering and automated learning. pp. 109- 116 ,(2005) , 10.1007/11508069_15

Charles Elkan, Alvaro E. Monge, An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records. DMKD. pp. 0- ,(1997)

Peter Christen, Karl Goiser, Quality and Complexity Measures for Data Linkage and Deduplication Quality Measures in Data Mining. pp. 127- 151 ,(2007) , 10.1007/978-3-540-44918-8_6

Mauricio A. Hernández, Salvatore J. Stolfo, Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem Data Mining and Knowledge Discovery. ,vol. 2, pp. 9- 37 ,(1998) , 10.1023/A:1009761603038

Parag Singla, Pedro Domingos, Object Identification with Attribute-Mediated Dependences Knowledge Discovery in Databases: PKDD 2005. pp. 297- 308 ,(2005) , 10.1007/11564126_31

Hamid Haidarian Shahri, Ahmad Abdollahzadeh Barforush, A Flexible Fuzzy Expert System for Fuzzy Duplicate Elimination in Data Cleaning Lecture Notes in Computer Science. pp. 161- 170 ,(2004) , 10.1007/978-3-540-30075-5_16

Felix Naumann, Melanie Herschel, An Introduction to Duplicate Detection Synthesis Lectures on Data Management. ,vol. 2, pp. 92- 92 ,(2010) , 10.2200/S00262ED1V01Y201003DTM003

10.

Mauricio A. Hernández, Salvatore J. Stolfo, The merge/purge problem for large databases international conference on management of data. ,vol. 24, pp. 127- 138 ,(1995) , 10.1145/223784.223807

Adaptive Windows for Duplicate Detection

来源期刊

我的账户

Adaptive Windows for Duplicate Detection

来源期刊

相似文章 10

我的账户