Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis

作者: Pablo A. Jaskowiak , Ricardo J.G.B. Campello , Ivan G. Costa

DOI: 10.1109/TCBB.2013.9

关键词: Microarray analysis techniquesMathematicsData miningSpearman's rank correlation coefficientBenchmark (computing)Cluster analysisGene expression profilingSimilarity (network science)Euclidean distancePearson product-moment correlation coefficient

摘要: Cluster analysis is usually the first step adopted to unveil information from gene expression microarray data. Besides selecting a clustering algorithm, choosing an appropriate proximity measure (similarity or distance) of great importance achieve satisfactory results. Nevertheless, up date, there are no comprehensive guidelines concerning how choose measures for Pearson most used measure, whereas characteristics other ones remain unexplored. In this paper, we investigate choice data by evaluating performance 16 in 52 sets time course and cancer experiments. Our results support that rarely employed literature can provide better than commonly ones, such as Pearson, Spearman, euclidean distance. Given different stood out evaluations, their should be specific each scenario. To evaluate on time-course data, preprocessed compiled 17 benchmark along with new methodology, called Intrinsic Biological Separation Ability (IBSA). Both future research assess effectiveness

参考文章(62)
Pablo A. Jaskowiak, Ricardo J. G. B. Campello, Ivan G. Costa, Evaluating Correlation Coefficients for Clustering Gene Expression Profiles of Cancer Advances in Bioinformatics and Computational Biology. pp. 120- 131 ,(2012) , 10.1007/978-3-642-31927-3_11
R. Giancarlo, G. Lo Bosco, L. Pinello, F. Utro, The three steps of clustering in the post-genomic era: a synopsis computational intelligence methods for bioinformatics and biostatistics. pp. 13- 30 ,(2010) , 10.1007/978-3-642-21946-7_2
Janez Demšar, Statistical Comparisons of Classifiers over Multiple Data Sets Journal of Machine Learning Research. ,vol. 7, pp. 1- 30 ,(2006)
David J. Hand, Robert J. Till, A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems Machine Learning. ,vol. 45, pp. 171- 186 ,(2001) , 10.1023/A:1010920819831
Francisco A. T. de Carvalho, Ivan G. Costa, Marcílio C. P. de Souto, Comparative study on proximity indices for cluster analysis of gene expression time series Journal of Intelligent and Fuzzy Systems. ,vol. 13, pp. 133- 142 ,(2002)
Pang-Ning Tan, Vipin Kumar, Michael Steinbach, Introduction to Data Mining, (First Edition) Addison-Wesley Longman Publishing Co., Inc.. ,(2005)
Travis J Hestilow, Yufei Huang, Clustering of gene expression data based on shape similarity Eurasip Journal on Bioinformatics and Systems Biology. ,vol. 2009, pp. 3- ,(2009) , 10.1155/2009/195712
Laurie J Heyer, Semyon Kruglyak, Shibu Yooseph, Exploring Expression Data: Identification and Analysis of Coexpressed Genes Genome Research. ,vol. 9, pp. 1106- 1115 ,(1999) , 10.1101/GR.9.11.1106
Young Sook Son, Jangsun Baek, A modified correlation coefficient based similarity measure for clustering time-course gene expression data Pattern Recognition Letters. ,vol. 29, pp. 232- 242 ,(2008) , 10.1016/J.PATREC.2007.09.015