On speeding up the implementation of nearest neighbour search and classification

作者: Ivo Marinchev , Gennady Agre

DOI: 10.1145/2812428.2812464

关键词:

摘要: The paper presents practical approaches and techniques to speeding up implementations of nearest neighbour search/classification algorithm for high dimensional data and/or many training examples. Such settings often appear in the fields big mining. We apply a fast iterative form polar decomposition use computed matrix pre-select smaller number candidate classes query element. show that additional speed can be achieved when consists instances by subdividing them subclasses approximation some clustering resulting classification is used building matrix. Our pre-processing (depends linearly or near on examples dimensions) pre-selection steps classes) with any well-known indexing method as annulus method, kd-trees, metric trees, r-trees, cover etc limit process. Finally we introduce what name cluster index practice it extends applicability structures higher order complexity bigger datasets.

参考文章(22)
Stephen M. Omohundro, Efficient Algorithms with Neural Network Behavior. Complex Systems. ,vol. 1, ,(1987)
Levent Ertöz, Michael S. Steinbach, Vipin Kumar, Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data. siam international conference on data mining. pp. 47- 58 ,(2003)
Richard E. Ladner, Kevin C. Zatloukal, Mary Holland Johnson, Nearest neighbor search for data compression. Data Structures, Near Neighbor Searches, and Methodology. pp. 69- 86 ,(1999)
Hans-Peter Kriegel, Peer Kröger, Erich Schubert, Arthur Zimek, Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data Advances in Knowledge Discovery and Data Mining. pp. 831- 838 ,(2009) , 10.1007/978-3-642-01307-2_86
Michael E. Houle, Hans-Peter Kriegel, Peer Kröger, Erich Schubert, Arthur Zimek, Can shared-neighbor distances defeat the curse of dimensionality? statistical and scientific database management. pp. 482- 500 ,(2010) , 10.1007/978-3-642-13818-8_34
Alexander Hinneburg, Charu C. Aggarwal, Daniel A. Keim, What Is the Nearest Neighbor in High Dimensional Spaces very large data bases. pp. 506- 515 ,(2000)
Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft, When Is ''Nearest Neighbor'' Meaningful? international conference on database theory. pp. 217- 235 ,(1999) , 10.1007/3-540-49257-7_15
Kristin P. Bennett, Usama Fayyad, Dan Geiger, Density-based indexing for approximate nearest-neighbor queries knowledge discovery and data mining. pp. 233- 243 ,(1999) , 10.1145/312129.312236
Michael E. Houle, Navigating massive data sets via local clustering knowledge discovery and data mining. pp. 547- 552 ,(2003) , 10.1145/956750.956817
Jerome H. Friedman, Jon Louis Bentley, Raphael Ari Finkel, An Algorithm for Finding Best Matches in Logarithmic Expected Time ACM Transactions on Mathematical Software. ,vol. 3, pp. 209- 226 ,(1977) , 10.1145/355744.355745