Declarative Data Cleaning: Language, Model, and Algorithms

作者: Helena Galhardas , Daniela Florescu , Dennis Shasha , Eric Simon , Cristian-Augustin Saita

DOI:

关键词:

摘要: The problem of data cleaning, which consists emoving inconsistencies and errors from original sets, is well known in the area decision support systems warehouses. However, for non-conventional applications, such as migration largely unstructured into structured one, or integration heterogeneous scientific sets inter-discipl- inary fields (e.g., environmental science), existing ETL (Extraction Transformation Loading) cleaning tools writing programs are insufficient. main challenge with them design a flow graph that effectively generates clean data, can perform efficiently on large input data. difficulty comes (i) lack clear separation between logical specification transformations their physical implementation (ii) explanation results user interaction facilities to tune program. This paper addresses these two problems presents language, an execution model algorithms enable users express specifications declaratively efficiently. We use example set bibliographic references used construct Citeseer Web site. underlying derive textual records so meaningful queries be performed. Experimental report assessement proposed framework cleaning.

参考文章(13)
Alvaro E. Monge, Matching Algorithms within a Duplicate Detection System. IEEE Data(base) Engineering Bulletin. ,vol. 23, pp. 14- 20 ,(2000)
Fereidoon Sadri, Laks V. S. Lakshmanan, Iyer N. Subramanian, SchemaSQL - A Language for Interoperability in Relational Multi-Database Systems very large data bases. pp. 239- 250 ,(1996)
Erhard Rahm, Hong Hai Do, Data Cleaning: Problems and Current Approaches. IEEE Data(base) Engineering Bulletin. ,vol. 23, pp. 3- 13 ,(2000)
Mauricio A. Hernández, Salvatore J. Stolfo, Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem Data Mining and Knowledge Discovery. ,vol. 2, pp. 9- 37 ,(1998) , 10.1023/A:1009761603038
Inria Rocquencourt, Cristian-Augustin Saita, Propel, Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, Improving Data Cleaning Quality using a Data Lineage Facility DMDW. pp. 3- ,(2001)
C. Faloutsos, R. Barber, M. Flickner, J. Hafner, W. Niblack, D. Petkovic, W. Equitz, Efficient and effective querying by image content intelligent information systems. ,vol. 3, pp. 231- 262 ,(1994) , 10.1007/BF00962238
Mauricio A. Hernández, Salvatore J. Stolfo, The merge/purge problem for large databases international conference on management of data. ,vol. 24, pp. 127- 138 ,(1995) , 10.1145/223784.223807
William W. Cohen, Integration of heterogeneous databases without common domains using queries based on textual similarity Proceedings of the 1998 ACM SIGMOD international conference on Management of data - SIGMOD '98. ,vol. 27, pp. 201- 212 ,(1998) , 10.1145/276304.276323
Surajit Chaudhuri, Umeshwar Dayal, An overview of data warehousing and OLAP technology international conference on management of data. ,vol. 26, pp. 65- 74 ,(1997) , 10.1145/248603.248616
T.F. Smith, M.S. Waterman, Identification of common molecular subsequences. Journal of Molecular Biology. ,vol. 147, pp. 195- 197 ,(1981) , 10.1016/0022-2836(81)90087-5