作者: Helena Galhardas , Daniela Florescu , Dennis Shasha , Eric Simon , Cristian-Augustin Saita
DOI:
关键词:
摘要: The problem of data cleaning, which consists emoving inconsistencies and errors from original sets, is well known in the area decision support systems warehouses. However, for non-conventional applications, such as migration largely unstructured into structured one, or integration heterogeneous scientific sets inter-discipl- inary fields (e.g., environmental science), existing ETL (Extraction Transformation Loading) cleaning tools writing programs are insufficient. main challenge with them design a flow graph that effectively generates clean data, can perform efficiently on large input data. difficulty comes (i) lack clear separation between logical specification transformations their physical implementation (ii) explanation results user interaction facilities to tune program. This paper addresses these two problems presents language, an execution model algorithms enable users express specifications declaratively efficiently. We use example set bibliographic references used construct Citeseer Web site. underlying derive textual records so meaningful queries be performed. Experimental report assessement proposed framework cleaning.