Why Classification Models Using Array Gene Expression Data Perform So Well: A Preliminary Investigation of Explanatory Factors.

作者: Constantin F. Aliferis , Alexander R. Statnikov , Pierre P. Massion , Ioannis Tsamardinos , Douglas P. Hardin

DOI:

关键词:

摘要: Results in the literature of classification models from microarray data often appear to be exceedingly good relative most other domains machine learning and clinical diagnostics. Yet array are noisy, have very small sample-to-variable ratios. What is explanation for such exemplary, yet counterintuitive, performance? Answering this question has significant implications (a) broad acceptance by medical biostatistical community, (b) gaining valuable insight on properties domain. To address problem we build several three tasks a gene expression dataset with 12,600 oligonucleotides 203 patient cases. We then study effects of: classifier type (kernelbased/non-kernel-based, linear/non-linear), sample size, selection within cross-validation, information redundancy. Our analyses show that redundancy choice strongest performance. Linear bias classifiers, size (as long as kernel classifiers used) relatively effects; train-test ratio, cross-validation method small-to-negligible effects.

参考文章(14)
Pedro Domingos, Occam's two razors: the sharp and the blunt knowledge discovery and data mining. pp. 37- 43 ,(1998)
Constantin F. Aliferis, Nafeh Fananapazir, Alexander R. Statnikov, Pierre P. Massion, Ioannis Tsamardinos, Douglas P. Hardin, Machine Learning Models for Classification of Lung Cancer and Selection of Genomic Markers Using Array Gene Expression Data. the florida ai research society. pp. 67- 71 ,(2003)
Foster Provost, R Fawcett, T, Kohavi, The Case against Accuracy Estimation for Comparing Induction Algorithms international conference on machine learning. pp. 445- 453 ,(1998)
Constantin F. Aliferis, Pierre P. Massion, Douglas P. Hardin, Machine learning models for lung cancer classification using array comparative genomic hybridization. american medical informatics association annual symposium. pp. 7- 11 ,(2002)
BSCH OLKOPF, C Burges, A Smola, Advances in kernel methods: support vector learning international conference on neural information processing. ,(1999) , 10.5555/299094
Howard B. Demuth, Martin T. Hagan, Mark Beale, Neural network design ,(1995)
Kimberlee Gauvreau, Marcello Pagano, Principles of biostatistics ,(1992)
Atul J. Butte, Alvin Kho, Isaac S. Kohane, Microarrays for an Integrative Genomics ,(2002)
A. Bhattacharjee, W. G. Richards, J. Staunton, C. Li, S. Monti, P. Vasa, C. Ladd, J. Beheshti, R. Bueno, M. Gillette, M. Loda, G. Weber, E. J. Mark, E. S. Lander, W. Wong, B. E. Johnson, T. R. Golub, D. J. Sugarbaker, M. Meyerson, Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses Proceedings of the National Academy of Sciences of the United States of America. ,vol. 98, pp. 13790- 13795 ,(2001) , 10.1073/PNAS.191502998