Scorecard construction with unbalanced class sizes

作者: Veronica Vinciotti , David J. Hand

DOI:

关键词:

摘要: A long-running issue in scorecard construction is how to handle dramatically unbalanced class sizes. This important because, many applications, the sizes are very different. For example, it common find that 'bad' customers constitute less than 10% of customer base and even more extreme situations often arise: Brause et al (1999) remark their database credit card transactions ‘the probability fraud low (0.2%) has been lowered a preprocessing step by conventional detecting system down 0.1%,' while Hassibi (2000) comments ‘out some 12 billion made annually, approximately 10 million – or one out every 1200 turn be fraudulent. Also, 0.04% (4 10,000) all monthly active accounts fraudulent.’ In coping with classes, there two issues considered. Firstly, what performance criterion appropriate? And, secondly, should constructed, any parameters estimated, from such data? We look at each these problems. first problem, we illustrate effect marked lack balance on criteria, demonstrating easy misled. The means simple error counts inappropriate as criteria. Rather, misclassifications smaller must regarded serious converse: different costs adopted for kinds misclassification. examine implications this. second, both classical linear scorecards powerful knearest-neighbour nonparametric methods, used detection. case (and, generally, parametric form) improved classification accuracy achieved focusing particular parts data space, relevant being implied relative misclassification costs. describe new tool constructing which takes this fact into account. k-nearest-neighbour draw attention phenomenon believe not previously reported, an choice k. methods using large set unsecured personal loan data.

参考文章(26)
Stan Matwin, Miroslav Kubat, Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. international conference on machine learning. pp. 179- 186 ,(1997)
Charles X. Ling, Chenghui Li, Data mining for direct marketing: problems and solutions knowledge discovery and data mining. pp. 73- 79 ,(1998)
Salvatore J. Stolfo, Philip K. Chan, Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection knowledge discovery and data mining. pp. 164- 168 ,(1998)
Charles Elkan, The foundations of cost-sensitive learning international joint conference on artificial intelligence. pp. 973- 978 ,(2001)
Krzysztof Krawiec, Business Applications of Neural Networks European Journal of Operational Research. ,vol. 147, pp. 448- 449 ,(2003) , 10.1016/S0377-2217(02)00302-8
Claire Cardie, Nicholas Nowe, Improving Minority Class Prediction Using Case-Specific Feature Weights international conference on machine learning. pp. 57- 65 ,(1997)
Jeffrey P. Bradford, Clayton Kunz, Ron Kohavi, Cliff Brunk, Carla E. Brodley, Pruning decision trees with misclassification costs european conference on machine learning. pp. 131- 136 ,(1998) , 10.1007/BFB0026682
N Cristianini, Icg Campbell, K Veropoulos, Controlling the Sensitivity of Support Vector Machines pp. 55- 60 ,(1999)
Aijun An, Nick Cercone, Xiangji Huang, A Case Study for Learning from Imbalanced Data Sets Advances in Artificial Intelligence. pp. 1- 15 ,(2001) , 10.1007/3-540-45153-6_1
Salvatore J. Stolfo, Philip K. Chan, Junxin Zhang, Wei Fan, AdaCost: Misclassification Cost-Sensitive Boosting international conference on machine learning. pp. 97- 105 ,(1999)