Finding near neighbors through cluster pruning

作者: Flavio Chierichetti , Alessandro Panconesi , Prabhakar Raghavan , Mauro Sozio , Alessandro Tiberi

DOI: 10.1145/1265530.1265545

关键词: Data pointSynthetic dataNearest neighbor searchData miningNearest-neighbor chain algorithmAlgorithmGenerative modelk-nearest neighbors algorithmCluster analysisFixed-radius near neighborsMathematics

摘要: Finding near(est) neighbors is a classic, difficult problem in data management and retrieval, with applications text image search,in finding similar objects matching patterns. Here we study cluster pruning, an extremely simple randomized technique. During preprocessing randomly choose subset of points to be leaders the remaining are partitioned by which leader closest. For query processing, find leader(s) closest point. We then seek nearest for point among only clusters leader(s). Recursion may used both search. Such schemes approximate that "almost as good" neighbors. How good these approximations how much do they save computation.Our contributions are: (1) quantify metrics allow us tradeoff between processing quality neighbors; (2) give rigorous theoretical analysis our schemes, under natural generative processes (generalizing Gaussian mixtures) points; (3) experiments on synthetic from such processes, well document corpus, confirming orders magnitude cost at modest compromises retrieved points. In particular, show p-spheres, state-of-the-art solution, outperformed scheme whether stored main or external memo.

参考文章(24)
Raghu Ramakrishnan, Jonathan Goldstein, Contrast Plots and P-Sphere Trees: Space vs. Time in Nearest Neighbour Searches very large data bases. pp. 429- 440 ,(2000)
Piotr Indyk, Aristides Gionis, Rajeev Motwani, Similarity Search in High Dimensions via Hashing very large data bases. pp. 518- 529 ,(1999)
Sergey Brin, Near Neighbor Search in Large Metric Spaces very large data bases. pp. 574- 584 ,(1995)
Florence Jessie MacWilliams, Neil James Alexander Sloane, The Theory of Error-Correcting Codes ,(1977)
George Karypis, Michael Steinbach, Vipin Kumar, A Comparison of Document Clustering Techniques ,(2000)
Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft, When Is ''Nearest Neighbor'' Meaningful? international conference on database theory. pp. 217- 235 ,(1999) , 10.1007/3-540-49257-7_15
Kristin P. Bennett, Usama Fayyad, Dan Geiger, Density-based indexing for approximate nearest-neighbor queries knowledge discovery and data mining. pp. 233- 243 ,(1999) , 10.1145/312129.312236
Tolga Bozkaya, Meral Ozsoyoglu, Distance-based indexing for high-dimensional metric spaces international conference on management of data. ,vol. 26, pp. 357- 368 ,(1997) , 10.1145/253260.253345
Marshall Bern, Approximate closest-point queries in high dimensions Information Processing Letters. ,vol. 45, pp. 95- 99 ,(1993) , 10.1016/0020-0190(93)90222-U
Herbert Edelsbrunner, Algorithms in Combinatorial Geometry ,(1987)