Duplicate data elimination system

作者: Surajit Chaudhuri , Rahul Kapoor , Venkatesh Ganti

DOI:

关键词: Graph (abstract data type)TupleData recordsComputer scienceInformation retrieval

摘要: A process for finding a similar data records from set of records. database table or tables provide number which one more canonical are identified. Tokens identified within the and classified according to attribute field. similarity score is assigned in relation other based on between tokens Data whose with respect each greater than threshold form groups The tuples nodes graph wherein edges represent group. Within group record

参考文章(24)
Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, Cristian-Augustin Saita, Declarative Data Cleaning: Language, Model, and Algorithms very large data bases. pp. 371- 380 ,(2001)
Rohit Ananthakrishna, Surajit Chaudhuri, Venkatesh Ganti, Eliminating fuzzy duplicates in data warehouses very large data bases. pp. 586- 597 ,(2002) , 10.1016/B978-155860869-6/50058-5
Erhard Rahm, Hong Hai Do, Data Cleaning: Problems and Current Approaches. IEEE Data(base) Engineering Bulletin. ,vol. 23, pp. 3- 13 ,(2000)
Y. Huhtala, J. Karkkainen, P. Porkka, H. Toivonen, Efficient discovery of functional and approximate dependencies using partitions international conference on data engineering. pp. 392- 401 ,(1998) , 10.1109/ICDE.1998.655802
H. Galhard, D. Florescu, D. Shasha, E. Simon, An extensible Framework for Data Cleaning international conference on data engineering. pp. 312- 312 ,(2000) , 10.1109/ICDE.2000.839429
ndez, Salvatore J. Stolfo, Mauricio A. Herna, Method of merging large databases in parallel ,(1994)
A. N. Arslan, O. Egecioglu, P. A. Pevzner, A new approach to sequence comparison: normalized sequence alignment. Bioinformatics. ,vol. 17, pp. 327- 337 ,(2001) , 10.1093/BIOINFORMATICS/17.4.327
Mauricio A. Hernández, Salvatore J. Stolfo, The merge/purge problem for large databases international conference on management of data. ,vol. 24, pp. 127- 138 ,(1995) , 10.1145/223784.223807
Vinayak Borkar, Kaustubh Deshmukh, Sunita Sarawagi, Automatic segmentation of text into structured records international conference on management of data. ,vol. 30, pp. 175- 186 ,(2001) , 10.1145/375663.375682