Robust record linkage blocking using suffix arrays

作者: Timothy de Vries , Hui Ke , Sanjay Chawla , Peter Christen

DOI: 10.1145/1645953.1645994

关键词: Search engine indexingComputer scienceData structureCompressed suffix arraySuffixData miningSuffix array

摘要: Record linkage is an important data integration task that has many practical uses for matching, merging and duplicate removal in large diverse databases. However, a quadratic scalability the brute force approach necessitates design of appropriate indexing or blocking techniques. We evaluate efficient highly scalable based on suffix arrays. Our grouping technique exploits ordering used by index to merge similar blocks at marginal extra cost, resulting much higher accuracy while retaining high base array method. Efficiently suffixes carried out with use sliding window technique. carry in-depth analysis our method show results from experiments using real synthetic data, which highlights importance world applications where sets contain millions records.

参考文章(17)
Howard B. Newcombe, James M. Kennedy, Record linkage: making maximum use of the discriminating power of identifying information Communications of The ACM. ,vol. 5, pp. 563- 566 ,(1962) , 10.1145/368996.369026
M.G. Elfeky, V.S. Verykios, A.K. Elmagarmid, TAILOR: a record linkage toolbox international conference on data engineering. pp. 17- 28 ,(2002) , 10.1109/ICDE.2002.994694
Arvind Arasu, Christopher Ré, Dan Suciu, None, Large-Scale Deduplication with Constraints Using Dedupalog 2009 IEEE 25th International Conference on Data Engineering. pp. 952- 963 ,(2009) , 10.1109/ICDE.2009.43
Su Yan, Dongwon Lee, Min-Yen Kan, Lee C. Giles, Adaptive sorted neighborhood methods for efficient record linkage Proceedings of the 2007 conference on Digital libraries - JCDL '07. pp. 185- 194 ,(2007) , 10.1145/1255175.1255213
Peter Christen, Tim Churches, Rohan Baxter, A Comparison of Fast Blocking Methods for Record Linkage knowledge discovery and data mining. ,(2003)
J. T. Marshall, Canada's national vital statistics index Population Studies-a Journal of Demography. ,vol. 1, pp. 204- 211 ,(1947) , 10.1080/00324728.1947.10415531
Lifang Gu, Deanne Vickers, Chris Rainsford, Rohan Baxter, Record Linkage: Current Practice and Future Directions ,(2003)
Peter Christen, Towards Parameter-free Blocking for Scalable Record Linkage Canberra, ACT: Dept. of Computer Science, Faculty of Engineering and Information Technology, The Australian National University. ,(2007)
Mauricio A. Hernández, Salvatore J. Stolfo, Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem Data Mining and Knowledge Discovery. ,vol. 2, pp. 9- 37 ,(1998) , 10.1023/A:1009761603038
Lian'en Huang, Lei Wang, Xiaoming Li, Achieving both high precision and high recall in near-duplicate detection Proceeding of the 17th ACM conference on Information and knowledge mining - CIKM '08. pp. 63- 72 ,(2008) , 10.1145/1458082.1458094