Clustering by pattern similarity in large data sets

作者: Haixun Wang , Wei Wang , Jiong Yang , Philip S. Yu

DOI: 10.1145/564691.564737

关键词: Clustering high-dimensional dataClosenessFuzzy clusteringArtificial intelligenceData miningSimilarity (network science)Euclidean distanceComputer sciencePattern recognitionCollaborative filteringSet (abstract data type)Data setCluster analysis

摘要: Clustering is the process of grouping a set objects into classes similar objects. Although definitions similarity vary from one clustering model to another, in most these models concept based on distances, e.g., Euclidean distance or cosine distance. In other words, are required have close values at least dimensions. this paper, we explore more general type similarity. Under pCluster proposed, two if they exhibit coherent pattern subset For instance, DNA microarray analysis, expression levels genes may rise and fall synchronously response environmental stimuli. magnitude their not be close, patterns can very much alike. Discovery such clusters essential revealing significant connections gene regulatory networks. E-commerce applications, as collaborative filtering, also benefit new model, which captures only closeness certain leading indicators but (purchasing, browsing, etc.) exhibited by customers. Our paper introduces an effective algorithm detect clusters, perform tests several real synthetic data sets show its effectiveness.

参考文章(18)
George M. Church, Yizong Cheng, Biclustering of Expression Data intelligent systems in molecular biology. ,vol. 8, pp. 93- 103 ,(2000)
Ryszard S. Michalski, Robert E. Stepp, Learning from Observation: Conceptual Clustering Machine Learning. pp. 331- 363 ,(1983) , 10.1007/978-3-662-12405-5_11
Raymond T. Ng, Jiawei Han, Efficient and Effective Clustering Methods for Spatial Data Mining very large data bases. pp. 144- 155 ,(1994)
Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft, When Is ''Nearest Neighbor'' Meaningful? international conference on database theory. pp. 217- 235 ,(1999) , 10.1007/3-540-49257-7_15
Hans-Peter Kriegel, Martin Ester, Jörg Sander, Xiaowei Xu, A density-based algorithm for discovering clusters in large spatial Databases with Noise knowledge discovery and data mining. pp. 226- 231 ,(1996)
Keinosuke Fukunaga, Introduction to statistical pattern recognition (2nd ed.) Academic Press Professional, Inc.. ,(1990)
H. V. Jagadish, Raymond T. Ng, J. Madar, Semantic Compression and Pattern Extraction with Fascicles very large data bases. pp. 186- 198 ,(1999) , 10.14288/1.0051612
Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, Prabhakar Raghavan, Automatic subspace clustering of high dimensional data for data mining applications Proceedings of the 1998 ACM SIGMOD international conference on Management of data - SIGMOD '98. ,vol. 27, pp. 94- 105 ,(1998) , 10.1145/276304.276314
Chun-Hung Cheng, Ada Waichee Fu, Yi Zhang, None, Entropy-based subspace clustering for mining numerical data knowledge discovery and data mining. pp. 84- 93 ,(1999) , 10.1145/312129.312199
Charu C. Aggarwal, Joel L. Wolf, Philip S. Yu, Cecilia Procopiuc, Jong Soo Park, Fast algorithms for projected clustering international conference on management of data. ,vol. 28, pp. 61- 72 ,(1999) , 10.1145/304181.304188