CMI: An Information-Theoretic Contrast Measure for Enhancing Subspace Cluster and Outlier Detection

作者: Hoang Vu Nguyen , Emmanuel Müller , Jilles Vreeken , Fabian Keller , Klemens Böhm

DOI: 10.1137/1.9781611972832.22

关键词:

摘要: In many real world applications data is collected in multi-dimensional spaces, with the knowledge hidden subspaces (i.e., subsets of dimensions). It an open research issue to select meaningful without any prior about such patterns. Standard approaches, as pairwise correlation measures, or statistical approaches based on entropy, do not solve this problem; due their restrictive analysis and loss information discretization they are bound miss potential clusters outliers. paper, we focus finding strong mutual dependency selected dimension set. Chosen should provide a high discrepancy between outliers enhance detection these To measure this, propose novel contrast score that quantifies correlations by considering cumulative distributions— having discretize data. our experiments, show enhanced quality cluster outlier for both synthetic

参考文章(28)
Pritam Chanda, Murali Ramanathan, Aidong Zhang, Jianmei Yang, On Mining Statistically Significant Attribute Association Information. siam international conference on data mining. pp. 141- 152 ,(2010)
Hans-Peter Kriegel, Peer Kröger, Erich Schubert, Arthur Zimek, Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data Advances in Knowledge Discovery and Data Mining. pp. 831- 838 ,(2009) , 10.1007/978-3-642-01307-2_86
K. Sequeira, M. Zaki, SCHISM: a new approach for interesting subspace mining international conference on data mining. pp. 186- 193 ,(2004) , 10.1109/ICDM.2004.10099
Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft, When Is ''Nearest Neighbor'' Meaningful? international conference on database theory. pp. 217- 235 ,(1999) , 10.1007/3-540-49257-7_15
Hans-Peter Kriegel, Martin Ester, Jörg Sander, Xiaowei Xu, A density-based algorithm for discovering clusters in large spatial Databases with Noise knowledge discovery and data mining. pp. 226- 231 ,(1996)
John A. Lee, Michel Verleysen, Nonlinear Dimensionality Reduction ,(2007)
Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, Prabhakar Raghavan, Automatic subspace clustering of high dimensional data for data mining applications Proceedings of the 1998 ACM SIGMOD international conference on Management of data - SIGMOD '98. ,vol. 27, pp. 94- 105 ,(1998) , 10.1145/276304.276314
Fabian Keller, Emmanuel Muller, Klemens Bohm, HiCS: High Contrast Subspaces for Density-Based Outlier Ranking 2012 IEEE 28th International Conference on Data Engineering. pp. 1037- 1048 ,(2012) , 10.1109/ICDE.2012.88
Chun-Hung Cheng, Ada Waichee Fu, Yi Zhang, None, Entropy-based subspace clustering for mining numerical data knowledge discovery and data mining. pp. 84- 93 ,(1999) , 10.1145/312129.312199
Aleksandar Lazarevic, Vipin Kumar, Feature bagging for outlier detection knowledge discovery and data mining. pp. 157- 166 ,(2005) , 10.1145/1081870.1081891