Feature subset selection bias for classification learning

作者: Surendra K. Singhi , Huan Liu

DOI: 10.1145/1143844.1143951

关键词:

摘要: Feature selection is often applied to high-dimensional data prior classification learning. Using the same training dataset in both and learning can result so-called feature subset bias. This bias putatively exacerbate over-fitting negatively affect performance. However, current practice separate datasets are seldom employed for learning, because dividing into two classifier respectively reduces amount of that be used either task. work attempts address this dilemma. We formalize analyze its statistical properties, study factors bias, as well how impacts via various experiments. research endeavors provide illustration explanation why may not cause negative impact much expected regression.

参考文章(18)
David D. Jensen, Paul R. Cohen, Multiple Comparisons in Induction Algorithms Machine Learning. ,vol. 38, pp. 309- 338 ,(2000) , 10.1023/A:1007631014630
Mark A. Hall, Ian H. Witten, Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques ,(1999)
David Jensen, Jennifer Neville, Linkage and Autocorrelation Cause Feature Selection Bias in Relational Learning international conference on machine learning. pp. 259- 266 ,(2002)
Alan J. Miller, Subset Selection in Regression ,(2002)
B. F. Francis Ouellette, Andreas D. Baxevanis, Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins ,(2001)
Bianca Zadrozny, Learning and evaluating classifiers under sample selection bias Twenty-first international conference on Machine learning - ICML '04. pp. 114- ,(2004) , 10.1145/1015330.1015425
Chris Chatfield, Model uncertainty, data mining and statistical inference Journal of The Royal Statistical Society Series A-statistics in Society. ,vol. 158, pp. 419- 444 ,(1995) , 10.2307/2983440
Huan Liu, Lei Yu, Toward integrating feature selection algorithms for classification and clustering IEEE Transactions on Knowledge and Data Engineering. ,vol. 17, pp. 491- 502 ,(2005) , 10.1109/TKDE.2005.66
George Forman, An extensive empirical study of feature selection metrics for text classification Journal of Machine Learning Research. ,vol. 3, pp. 1289- 1305 ,(2003)
Isabelle Guyon, André Elisseeff, An introduction to variable and feature selection Journal of Machine Learning Research. ,vol. 3, pp. 1157- 1182 ,(2003) , 10.1162/153244303322753616