Discretisation of Continuous Commercial Database Features for a Simulated Annealing Data Mining Algorithm

作者: Justin C.W. Debuse , Victor J. Rayward-Smith

DOI: 10.1023/A:1008339026836

关键词: DatabaseSimulated annealingComputer scienceData mining algorithmOverfittingData miningDiscretization

摘要: An introduction to the approaches used discretise continuous database features is given, together with a discussion of potential benefits such techniques. These are investigated by applying discretisation algorithms two large commercial databases; discretisations yielded then evaluated using simulated annealing based data mining algorithm. The results produced suggest that dramatic reductions in problem size may be achieved, yielding improvements speed However, it also demonstrated under certain circumstances give an increase or allow overfitting Such cases, within which often only small proportion belongs class interest, highlight need both for caution when producing and development more robust algorithms.

参考文章(23)
Ron Kohavi, Mehran Sahami, Error-based and entropy-based discretization of continuous features knowledge discovery and data mining. pp. 114- 119 ,(1996)
Peter Auer, Robert C. Holte, Wolfgang Maass, Theory and Applications of Agnostic PAC-Learning with Small Decision Trees Machine Learning Proceedings 1995. pp. 21- 29 ,(1995) , 10.1016/B978-1-55860-377-6.50012-8
Randy Kerber, ChiMerge: discretization of numeric attributes national conference on artificial intelligence. pp. 123- 128 ,(1992)
Keki B. Irani, Usama M. Fayyad, Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning international joint conference on artificial intelligence. ,vol. 2, pp. 1022- 1027 ,(1993)
J. Catlett, On changing continuous attributes into ordered discrete attributes Lecture Notes in Computer Science. pp. 164- 178 ,(1991) , 10.1007/BFB0017012
James Dougherty, Ron Kohavi, Mehran Sahami, Supervised and Unsupervised Discretization of Continuous Features Machine Learning Proceedings 1995. pp. 194- 202 ,(1995) , 10.1016/B978-1-55860-377-6.50032-3
J. C. W. Debuse, B. de la Iglesia, V. J. Rayward-Smith, Discovering knowledge in commercial databases using modern heuristic techniques knowledge discovery and data mining. pp. 44- 49 ,(1996)
J. R. Quinlan, Improved use of continuous attributes in C4.5 Journal of Artificial Intelligence Research. ,vol. 4, pp. 77- 90 ,(1996) , 10.1613/JAIR.279
J. Ross Quinlan, Ronald L. Rivest, Inferring decision trees using the minimum description length principle Information & Computation. ,vol. 80, pp. 227- 248 ,(1989) , 10.1016/0890-5401(89)90010-2
Walter D. Fisher, On Grouping for Maximum Homogeneity Journal of the American Statistical Association. ,vol. 53, pp. 789- 798 ,(1958) , 10.1080/01621459.1958.10501479