Impact of Data Sampling on Stability of Feature Selection for Software Measurement Data

作者: Kehan Gao , Taghi M. Khoshgoftaar , Amri Napolitano

DOI: 10.1109/ICTAI.2011.172

关键词:

摘要: Software defect prediction can be considered a binary classification problem. Generally, practitioners utilize historical software data, including metric and fault data collected during the development process, to build model then employ this predict new program modules as either fault-prone (fp) or not-fault-prone (nfp). Limited project resources allocated according results by (for example) assigning more reviews testing predicted potentially defective. Two challenges often come with modeling process: (1) high-dimensionality of measurement (2) skewed imbalanced distributions between two types (fp nfp) in those datasets. To overcome these problems, extensive studies have been dedicated towards improving quality training data. The commonly used techniques are feature selection sampling. Usually, researchers focus on evaluating performance after is modified. present study assesses technique from different perspective. We interested studying stability method, especially understanding impact sampling when using sampled Some interesting findings found based case performed datasets real-world projects.

参考文章(25)
Shyam Visweswaran, Jonathan L. Lustgarten, Vanathi Gopalakrishnan, Measuring Stability of Feature Selection in Biomedical Datasets american medical informatics association annual symposium. ,vol. 2009, pp. 406- 410 ,(2009)
Ludmila I. Kuncheva, A stability index for feature selection conference on artificial intelligence for applications. pp. 390- 395 ,(2007)
Pavel Křížek, Josef Kittler, Václav Hlaváč, None, Improving stability of feature selection methods computer analysis of images and patterns. pp. 929- 936 ,(2007) , 10.1007/978-3-540-74272-2_115
Abu HM Kamal, Xingquan Zhu, Abhijit S Pandya, Sam Hsu, Muhammad Shoaib, None, The Impact of Gene Selection on Imbalanced Microarray Expression Data international conference on bioinformatics. pp. 259- 269 ,(2009) , 10.1007/978-3-642-00727-9_25
Padraig Cunningham, Francisco Azuaje, Kevin Dunne, Solutions to Instability Problems with Sequential Wrapper-based Approaches to Feature Selection Trinity College Dublin, Department of Computer Science. ,(2002)
K. Jong, E. Marchiori, M. Sebag, A. van der Vaart, Feature selection in proteomic pattern data with support vector machines computational intelligence in bioinformatics and computational biology. pp. 41- 48 ,(2004) , 10.1109/CIBCB.2004.1393930
Mark A. Hall, Ian H. Witten, Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques ,(1999)
Vegard Engen, Jonathan Vincent, Keith Phalp, Enhancing network based intrusion detection for imbalanced data International Journal of Knowledge-based and Intelligent Engineering Systems. ,vol. 12, pp. 357- 367 ,(2008) , 10.3233/KES-2008-125-605
Kenji Kira, Larry A. Rendell, A Practical Approach to Feature Selection international conference on machine learning. pp. 249- 256 ,(1992) , 10.1016/B978-1-55860-247-2.50037-1