Adaptive Windows for Duplicate Detection

作者: Uwe Draisbach , Felix Naumann , Sascha Szott , Oliver Wonneberg

DOI: 10.1109/ICDE.2012.20

关键词: Artificial intelligencePattern recognitionComputer scienceSortingRecords managementSimilarity measurePartition (database)Data setDuplicate detectionData mining

摘要: Duplicate detection is the task of identifying all groups records within a data set that represent same real-world entity, respectively. This difficult, because (i) representations might differ slightly, so some similarity measure must be defined to compare pairs and (ii) sets have high volume making pair-wise comparison infeasible. To tackle second problem, many algorithms been suggested partition record only each partition. One well-known such approach Sorted Neighborhood Method (SNM), which sorts according key then advances window over comparing appear window. We propose with Count Strategy (DCS) variation SNM uses varying size. It based on intuition there regions suggesting larger size lower smaller Next basic variant DCS, we also thoroughly evaluate called DCS++ provably better than original in terms efficiency (same results fewer comparisons).

参考文章(24)
Lifang Gu, Rohan A. Baxter, Adaptive Filtering for Efficient Record Linkage. siam international conference on data mining. pp. 477- 481 ,(2004)
Peter Christen, Probabilistic data generation for deduplication and data linkage intelligent data engineering and automated learning. pp. 109- 116 ,(2005) , 10.1007/11508069_15
Peter Christen, Karl Goiser, Quality and Complexity Measures for Data Linkage and Deduplication Quality Measures in Data Mining. pp. 127- 151 ,(2007) , 10.1007/978-3-540-44918-8_6
Mauricio A. Hernández, Salvatore J. Stolfo, Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem Data Mining and Knowledge Discovery. ,vol. 2, pp. 9- 37 ,(1998) , 10.1023/A:1009761603038
Parag Singla, Pedro Domingos, Object Identification with Attribute-Mediated Dependences Knowledge Discovery in Databases: PKDD 2005. pp. 297- 308 ,(2005) , 10.1007/11564126_31
Hamid Haidarian Shahri, Ahmad Abdollahzadeh Barforush, A Flexible Fuzzy Expert System for Fuzzy Duplicate Elimination in Data Cleaning Lecture Notes in Computer Science. pp. 161- 170 ,(2004) , 10.1007/978-3-540-30075-5_16
Felix Naumann, Melanie Herschel, An Introduction to Duplicate Detection Synthesis Lectures on Data Management. ,vol. 2, pp. 92- 92 ,(2010) , 10.2200/S00262ED1V01Y201003DTM003
Mauricio A. Hernández, Salvatore J. Stolfo, The merge/purge problem for large databases international conference on management of data. ,vol. 24, pp. 127- 138 ,(1995) , 10.1145/223784.223807