Error-based and entropy-based discretization of continuous features

作者: Ron Kohavi , Mehran Sahami

DOI:

关键词: Data miningDiscretizationDecision tree learningEntropy (classical thermodynamics)Entropy (statistical thermodynamics)Entropy (information theory)Decision treeComputer scienceDiscretization errorAlgorithmEntropy (arrow of time)Discretization of continuous featuresComputational complexity theoryEntropy (order and disorder)Entropy (energy dispersal)

摘要: We present a comparison of error-based and entropy-based methods for discretization continuous features. Our study includes both an extensive empirical as well analysis scenarios where error minimization may be inappropriate criterion. method based on the C4.5 decision tree algorithm compare it to existing algorithm, which employs Minimum Description Length Principle, recently proposed technique. evaluate these with respect Naive-Bayesian classifiers datasets from UCI repository analyze computational complexity each method. results indicate that MDL heuristic outperforms average. then shortcomings approaches in methods.

参考文章(13)
Peter Auer, Robert C. Holte, Wolfgang Maass, Theory and Applications of Agnostic PAC-Learning with Small Decision Trees Machine Learning Proceedings 1995. pp. 21- 29 ,(1995) , 10.1016/B978-1-55860-377-6.50012-8
Se June Hong, Chidanand Apte, Predicting equity returns from securities data knowledge discovery and data mining. pp. 541- 560 ,(1996)
Keki B. Irani, Usama M. Fayyad, Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning international joint conference on artificial intelligence. ,vol. 2, pp. 1022- 1027 ,(1993)
J. Catlett, On changing continuous attributes into ordered discrete attributes Lecture Notes in Computer Science. pp. 164- 178 ,(1991) , 10.1007/BFB0017012
George H John, Ron Kohavi, Karl Pfleger, None, Irrelevant Features and the Subset Selection Problem Machine Learning Proceedings 1994. pp. 121- 129 ,(1994) , 10.1016/B978-1-55860-335-6.50023-4
James Dougherty, Ron Kohavi, Mehran Sahami, Supervised and Unsupervised Discretization of Continuous Features Machine Learning Proceedings 1995. pp. 194- 202 ,(1995) , 10.1016/B978-1-55860-377-6.50032-3
Scott Cost, Steven Salzberg, A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features Machine Learning. ,vol. 10, pp. 57- 78 ,(1993) , 10.1023/A:1022664626993
Wolfgang Maass, Efficient agnostic PAC-learning with simple hypothesis Proceedings of the seventh annual conference on Computational learning theory - COLT '94. pp. 67- 75 ,(1994) , 10.1145/180139.181016
R. Kohavi, G. John, R. Long, D. Manley, K. Pfleger, MLC++: a machine learning library in C++ international conference on tools with artificial intelligence. pp. 740- 743 ,(1994) , 10.1109/TAI.1994.346412