On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms

作者: Kenji Yamanishi , Jun-Ichi Takeuchi , Graham Williams , Peter Milne

DOI: 10.1145/347090.347160

关键词: One-class classificationMixture modelOutlierCategorical variableArtificial intelligenceStatistical learning theoryAnomaly detectionData miningMachine learningStatistical modelIntrusion detection systemUnsupervised learningAlgorithmComputer science

摘要: Outlier detection is a fundamental issue in data mining, specifically fraud detection, network intrusion monitoring, etc. SmartSifter an outlier engine addressing this problem from the viewpoint of statistical learning theory. This paper provides theoretical basis for and empirically demonstrates its effectiveness. detects outliers on-line process through unsupervised probabilistic model (using finite mixture model) information source. Each time datum input employs discounting algorithm to learn model. A score given based on learned with high indicating possibility being outlier. The novel features are: (1) it adaptive non-stationary sources data; (2) has clear statistical/information-theoretic meaning; (3) computationally inexpensive; (4) can handle both categorical continuous variables. An experimental application shows that was able identify scores corresponded attacks, low computational costs. Further identified number meaningful rare cases actual health insurance pathology Australia's Health Insurance Commission.

参考文章(22)
Salvatore J. Stolfo, Philip K. Chan, Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection knowledge discovery and data mining. pp. 164- 168 ,(1998)
Peter Burge, John Shawe-Taylor, Detecting Cellular Fraud Using Adaptive Prototypes. national conference on artificial intelligence. ,(1997)
Graham J. Williams, Zhexue Huang, Mining the Knowledge Mine: The Hot Spots Methodology for Mining Large Real World Databases australian joint conference on artificial intelligence. pp. 340- 348 ,(1997) , 10.1007/3-540-63797-4_87
Raymond T. Ng, Edwin M. Knorr, Algorithms for Mining Distance-Based Outliers in Large Datasets very large data bases. pp. 392- 403 ,(1998)
J McLachlan, G, D. Peel, Finite Mixture Models ,(2000)
Raymond T. Ng, Edwin M. Knorr, Finding Intensional Knowledge of Distance-Based Outliers very large data bases. pp. 211- 222 ,(1999)
J. S. Marron, M. P. Wand, Exact Mean Integrated Squared Error Annals of Statistics. ,vol. 20, pp. 712- 736 ,(1992) , 10.1214/AOS/1176348653
R. Krichevsky, V. Trofimov, The performance of universal encoding IEEE Transactions on Information Theory. ,vol. 27, pp. 199- 207 ,(1981) , 10.1109/TIT.1981.1056331
Tom Fawcett, Foster Provost, Activity monitoring: noticing interesting changes in behavior knowledge discovery and data mining. pp. 53- 62 ,(1999) , 10.1145/312129.312195