Disk-Based Sampling for Outlier Detection in High Dimensional Data

作者: Pei Sun , Timothy de Vries , Sanjay Chawla , Gia Vinh Anh Pham

DOI:

关键词: Phase (waves)Data miningClustering high-dimensional dataCurse of dimensionalitySampling (statistics)Computer scienceData setAlgorithmOutlierSet (abstract data type)Anomaly detection

摘要: We propose an efficient sampling based outlier detection method for large high-dimensional data. Our consists of two phases. In the first phase, we combine a “sampling” strategy with simple randomized partitioning technique to generate candidate set outliers. This phase requires one full data scan and running time has linear complexity respect size dimensionality set. An additional scan, which constitutes second extracts actual outliers from The this O(CN) where C N are respectively. major strengths proposed approach that (1) no dimensions is required thus making it particularly suitable high dimensional (2) small (0.5% original set) can discover more than 99% all identified by brute-force approach. present detailed experimental evaluation our on real synthetic sets compare another

参考文章(4)
Fabrizio Angiulli, Clara Pizzuti, Fast Outlier Detection in High Dimensional Spaces european conference on principles of data mining and knowledge discovery. pp. 15- 26 ,(2002) , 10.1007/3-540-45681-3_2
Raymond T. Ng, Edwin M. Knorr, Algorithms for Mining Distance-Based Outliers in Large Datasets very large data bases. pp. 392- 403 ,(1998)
Douglas M. Hawkins, Identification of outliers ,(1980)
Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim, Efficient algorithms for mining outliers from large data sets international conference on management of data. ,vol. 29, pp. 427- 438 ,(2000) , 10.1145/335191.335437