作者: Timothy de Vries , Hui Ke , Sanjay Chawla , Peter Christen
关键词: Search engine indexing 、 Computer science 、 Data structure 、 Compressed suffix array 、 Suffix 、 Data mining 、 Suffix array
摘要: Record linkage is an important data integration task that has many practical uses for matching, merging and duplicate removal in large diverse databases. However, a quadratic scalability the brute force approach necessitates design of appropriate indexing or blocking techniques. We evaluate efficient highly scalable based on suffix arrays. Our grouping technique exploits ordering used by index to merge similar blocks at marginal extra cost, resulting much higher accuracy while retaining high base array method. Efficiently suffixes carried out with use sliding window technique. carry in-depth analysis our method show results from experiments using real synthetic data, which highlights importance world applications where sets contain millions records.