作者: Constantin F. Aliferis , Alexander R. Statnikov , Pierre P. Massion , Ioannis Tsamardinos , Douglas P. Hardin
DOI:
关键词:
摘要: Results in the literature of classification models from microarray data often appear to be exceedingly good relative most other domains machine learning and clinical diagnostics. Yet array are noisy, have very small sample-to-variable ratios. What is explanation for such exemplary, yet counterintuitive, performance? Answering this question has significant implications (a) broad acceptance by medical biostatistical community, (b) gaining valuable insight on properties domain. To address problem we build several three tasks a gene expression dataset with 12,600 oligonucleotides 203 patient cases. We then study effects of: classifier type (kernelbased/non-kernel-based, linear/non-linear), sample size, selection within cross-validation, information redundancy. Our analyses show that redundancy choice strongest performance. Linear bias classifiers, size (as long as kernel classifiers used) relatively effects; train-test ratio, cross-validation method small-to-negligible effects.