A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning

作者: Salvador Garcia , J. Luengo , José Antonio Sáez , Victoria López , F. Herrera

DOI: 10.1109/TKDE.2012.35

关键词: Machine learningCategorizationComputer scienceKnowledge extractionArtificial intelligenceTaxonomy (general)Data pre-processingData setData miningDecision treeSet (abstract data type)Categorical variableSupervised learningDiscretization

摘要: Discretization is an essential preprocessing technique used in many knowledge discovery and data mining tasks. Its main goal to transform a set of continuous attributes into discrete ones, by associating categorical values intervals thus transforming quantitative qualitative data. In this manner, symbolic algorithms can be applied over the representation information simplified, making it more concise specific. The literature provides numerous proposals discretization some attempts categorize them taxonomy found. However, previous papers, there lack consensus definition properties no formal categorization has been established yet, which may confusing for practitioners. Furthermore, only small discretizers have widely considered, while other methods gone unnoticed. With intention alleviating these problems, paper survey proposed from theoretical empirical perspective. From perspective, we develop based on pointed out research, unifying notation including all known up date. Empirically, conduct experimental study supervised classification involving most representative newest discretizers, different types classifiers, large number sets. results their performances measured terms accuracy, intervals, inconsistency verified means nonparametric statistical tests. Additionally, are highlighted as best performing ones.

参考文章(126)
Matjaz Kukar, Igor Kononenko, Machine Learning and Data Mining: Introduction to Principles and Algorithms Horwood Publishing Limited. ,(2007)
F. J. Ferrer-Troyano, D. S. Rodríguez-Baena, J. S. Aguilar-Ruiz, J. C. Riquelme, R. Giráldez, Discretization oriented to Decision Rules Generation ,(2001)
Gregory F. Cooper, Stefano Monti, A latent variable model for multivariate discretization. international conference on artificial intelligence and statistics. ,(1999)
Ramon López de Màntaras, Jesus Cerquides, Proposal and empirical comparison of a parallelizable distance-based discretization method knowledge discovery and data mining. pp. 139- 142 ,(1997)
Gexiang Zhang, Laizhao Hu, Weidong Jin, Discretization of continuous attributes in rough set theory and its application computational intelligence and security. pp. 1020- 1026 ,(2004) , 10.1007/978-3-540-30497-5_157
F. Feschet, D. A. Zighed, R. Rakotomalala, Optimal multiple intervals discretization of continuous attributes for supervised learning knowledge discovery and data mining. pp. 295- 298 ,(1997)
Michael J. Pazzani, An iterative improvement approach for the discretization of numeric attributes in Bayesian classifiers knowledge discovery and data mining. pp. 228- 233 ,(1995)
Valentina Colla, Marco Vannucci, Meaningful discretization of continuous features for association rules mining by means of a SOM. the european symposium on artificial neural networks. pp. 489- 494 ,(2004)
João Gama, Luis Torgo, Carlos Soares, Dynamic Discretization of Continuous Attributes ibero american conference on ai. pp. 160- 169 ,(1998) , 10.1007/3-540-49795-1_14
Ramakrishnan Srikant, Rakesh Agrawal, Fast algorithms for mining association rules very large data bases. pp. 580- 592 ,(1998)