Probabilistic techniques for detecting duplicate tuples

作者: Ying Xu , Venkatesh Ganti

DOI:

关键词:

摘要: A technique for probabilistic determining fuzzy duplicates includes converting a plurality of tuples into hash vectors utilizing locality sensitive hashing algorithm. The are sorted, on one or more vector coordinates, to cluster similar coordinate values together. Each two identifies candidate tuples. compared similarity function. Tuples which than specified threshold returned.

参考文章(1)
Surajit Chaudhuri, Venkatesh Ganti, Rohit Ananthakrishna, Detecting duplicate records in databases ,(2005)