作者: Nick Koudas , Beng Chin Ooi , Suresh Venkatasubramanian , Divesh Srivastava , Bing Tian Dai
DOI:
关键词: Data management 、 Computer science 、 Data mining 、 Fuzzy clustering 、 Entropy (information theory) 、 Column (database) 、 Data quality
摘要: Data quality is a serious concern in every data management application, and variety of measures have been proposed, including accuracy, freshness completeness, to capture the common sources degradation. We identify focus attention on novel measure, column heterogeneity, that seeks quantify problems can arise when merging from different sources. desiderata heterogeneity measure should intuitively satisfy, discuss promising direction research database based using combination cluster entropy soft clustering. Finally, we present few preliminary experimental results, diverse sets semantically types, demonstrate this approach appears provide robust mechanism for identifying quantifying heterogeneity.