作者: Randall Wald , Taghi Khoshgoftaar , David Dittman
关键词: DNA microarray 、 Deviance (statistics) 、 Feature selection 、 Data mining 、 Computer science 、 Gene 、 Robustness (computer science) 、 Bioinformatics 、 Receiver operating characteristic 、 Algorithm
摘要: Feature (gene) selection has become an important and necessary step for combating high dimensionality, a problem found in bioinformatics datasets. Many studies have focused on gene selection, examining both the design of these techniques classification performance prediction models built using techniques. However, it is only recently that any work robustness or stability Robustness because which do not give reliable lists cannot be trusted to useful genes. Previous papers studying typically generate multiple random sub samples original dataset compare genes chosen from with one another, directly data. These methods known problems, either comparing two randomly-generated datasets unknown level overlap different sizes. This paper introduces new algorithm generating sample called fixed-overlap partitions. will exactly desired number instances. Using this method we evaluate nineteen feature twenty-six real world DNA microarray Our results show there are three rankers (Deviance, Receiver Operating Characteristic curve, Precision-Recall Curve) consistently most stable. overlap, quality data, selected effect ranker stable given situation. The partitions particular able find how varying levels can cause difficulty sometimes resemble another (for example, moderate-difficulty behave like easy-difficulty at low but diverge as increases).