Column heterogeneity as a measure of data quality

作者: Nick Koudas , Beng Chin Ooi , Suresh Venkatasubramanian , Divesh Srivastava , Bing Tian Dai

DOI:

关键词: Data managementComputer scienceData miningFuzzy clusteringEntropy (information theory)Column (database)Data quality

摘要: Data quality is a serious concern in every data management application, and variety of measures have been proposed, including accuracy, freshness completeness, to capture the common sources degradation. We identify focus attention on novel measure, column heterogeneity, that seeks quantify problems can arise when merging from different sources. desiderata heterogeneity measure should intuitively satisfy, discuss promising direction research database based using combination cluster entropy soft clustering. Finally, we present few preliminary experimental results, diverse sets semantically types, demonstrate this approach appears provide robust mechanism for identifying quantifying heterogeneity.

参考文章(8)
Jennifer Widom, Trio: A System for Integrated Management of Data, Accuracy, and Lineage conference on innovative data systems research. pp. 262- 276 ,(2004)
Tamraparni Dasu, Theodore Johnson, Exploratory Data Mining and Data Cleaning ,(2003)
Louiqa Raschid, Maria-esther Vidal, George A. Mihaila, Querying Quality of Data Metadata MD. ,(1998)
Tamraparni Dasu, Theodore Johnson, S. Muthukrishnan, Vladislav Shkapenyuk, Mining database structure; or, how to build a data quality browser Proceedings of the 2002 ACM SIGMOD international conference on Management of data - SIGMOD '02. pp. 240- 251 ,(2002) , 10.1145/564691.564719
Theodore Johnson, Tamraparni Dasu, Data quality and data cleaning: an overview international conference on management of data. pp. 681- 681 ,(2003) , 10.1145/872757.872875
Thomas M. Cover, Joy A. Thomas, Elements of information theory ,(1991)
Veijo Notkola, Harri Siiskonen, Quality of Data Fertility, Mortality and Migration in SubSaharan Africa. pp. 59- 67 ,(2000) , 10.1057/9780333981344_7
Naftali Tishby, Fernando C. N. Pereira, William Bialek, The information bottleneck method Proc. 37th Annual Allerton Conference on Communications, Control and Computing, 1999. pp. 368- 377 ,(2000)