A New Fixed-Overlap Partitioning Algorithm for Determining Stability of Bioinformatics Gene Rankers

作者: Randall Wald , Taghi Khoshgoftaar , David Dittman

DOI: 10.1109/ICMLA.2012.149

关键词: DNA microarrayDeviance (statistics)Feature selectionData miningComputer scienceGeneRobustness (computer science)BioinformaticsReceiver operating characteristicAlgorithm

摘要: Feature (gene) selection has become an important and necessary step for combating high dimensionality, a problem found in bioinformatics datasets. Many studies have focused on gene selection, examining both the design of these techniques classification performance prediction models built using techniques. However, it is only recently that any work robustness or stability Robustness because which do not give reliable lists cannot be trusted to useful genes. Previous papers studying typically generate multiple random sub samples original dataset compare genes chosen from with one another, directly data. These methods known problems, either comparing two randomly-generated datasets unknown level overlap different sizes. This paper introduces new algorithm generating sample called fixed-overlap partitions. will exactly desired number instances. Using this method we evaluate nineteen feature twenty-six real world DNA microarray Our results show there are three rankers (Deviance, Receiver Operating Characteristic curve, Precision-Recall Curve) consistently most stable. overlap, quality data, selected effect ranker stable given situation. The partitions particular able find how varying levels can cause difficulty sometimes resemble another (for example, moderate-difficulty behave like easy-difficulty at low but diverge as increases).

参考文章(18)
Shyam Visweswaran, Jonathan L. Lustgarten, Vanathi Gopalakrishnan, Measuring Stability of Feature Selection in Biomedical Datasets american medical informatics association annual symposium. ,vol. 2009, pp. 406- 410 ,(2009)
Ludmila I. Kuncheva, A stability index for feature selection conference on artificial intelligence for applications. pp. 390- 395 ,(2007)
Mark A. Hall, Ian H. Witten, Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques ,(1999)
Igor Kononenko, Estimating attributes: analysis and extensions of RELIEF european conference on machine learning. pp. 171- 182 ,(1994) , 10.1007/3-540-57868-4_57
Li-Yeh Chuang, Cheng-Huei Yang, Kuo-Chuan Wu, Cheng-Hong Yang, A hybrid feature selection method for DNA microarray data Computers in Biology and Medicine. ,vol. 41, pp. 228- 237 ,(2011) , 10.1016/J.COMPBIOMED.2011.02.004
Gregor Stiglic, Peter Kokol, Stability of ranked gene lists in large microarray analysis studies. BioMed Research International. ,vol. 2010, pp. 616358- 616358 ,(2010) , 10.1155/2010/616358
David Dittman, Taghi M. Khoshgoftaar, Randall Wald, Huanjing Wang, Stability Analysis of Feature Ranking Techniques on Biological Datasets bioinformatics and biomedicine. pp. 252- 256 ,(2011) , 10.1109/BIBM.2011.84
Sebastian Schneckener, Nilou S Arden, Andreas Schuppert, Quantifying stability in gene list ranking across microarray derived clinical biomarkers BMC Medical Genomics. ,vol. 4, pp. 73- 73 ,(2011) , 10.1186/1755-8794-4-73
Alexandros Kalousis, Julien Prados, Melanie Hilario, Stability of feature selection algorithms: a study on high-dimensional spaces Knowledge and Information Systems. ,vol. 12, pp. 95- 116 ,(2007) , 10.1007/S10115-006-0040-8
D. Dittman, T. M. Khoshgoftaar, R. Wald, A. Napolitano, Random forest: A reliable tool for patient response prediction bioinformatics and biomedicine. pp. 289- 296 ,(2011) , 10.1109/BIBMW.2011.6112389