作者: Pei Sun , Timothy de Vries , Sanjay Chawla , Gia Vinh Anh Pham
DOI:
关键词: Phase (waves) 、 Data mining 、 Clustering high-dimensional data 、 Curse of dimensionality 、 Sampling (statistics) 、 Computer science 、 Data set 、 Algorithm 、 Outlier 、 Set (abstract data type) 、 Anomaly detection
摘要: We propose an efficient sampling based outlier detection method for large high-dimensional data. Our consists of two phases. In the first phase, we combine a “sampling” strategy with simple randomized partitioning technique to generate candidate set outliers. This phase requires one full data scan and running time has linear complexity respect size dimensionality set. An additional scan, which constitutes second extracts actual outliers from The this O(CN) where C N are respectively. major strengths proposed approach that (1) no dimensions is required thus making it particularly suitable high dimensional (2) small (0.5% original set) can discover more than 99% all identified by brute-force approach. present detailed experimental evaluation our on real synthetic sets compare another