作者: Uwe Draisbach , Felix Naumann , Sascha Szott , Oliver Wonneberg
DOI: 10.1109/ICDE.2012.20
关键词: Artificial intelligence 、 Pattern recognition 、 Computer science 、 Sorting 、 Records management 、 Similarity measure 、 Partition (database) 、 Data set 、 Duplicate detection 、 Data mining
摘要: Duplicate detection is the task of identifying all groups records within a data set that represent same real-world entity, respectively. This difficult, because (i) representations might differ slightly, so some similarity measure must be defined to compare pairs and (ii) sets have high volume making pair-wise comparison infeasible. To tackle second problem, many algorithms been suggested partition record only each partition. One well-known such approach Sorted Neighborhood Method (SNM), which sorts according key then advances window over comparing appear window. We propose with Count Strategy (DCS) variation SNM uses varying size. It based on intuition there regions suggesting larger size lower smaller Next basic variant DCS, we also thoroughly evaluate called DCS++ provably better than original in terms efficiency (same results fewer comparisons).