Finding near neighbors through cluster pruning

作者： Flavio Chierichetti , Alessandro Panconesi , Prabhakar Raghavan , Mauro Sozio , Alessandro Tiberi

关键词: Data point 、 Synthetic data 、 Nearest neighbor search 、 Data mining 、 Nearest-neighbor chain algorithm 、 Algorithm 、 Generative model 、 k-nearest neighbors algorithm 、 Cluster analysis 、 Fixed-radius near neighbors 、 Mathematics

摘要: Finding near(est) neighbors is a classic, difficult problem in data management and retrieval, with applications text image search,in finding similar objects matching patterns. Here we study cluster pruning, an extremely simple randomized technique. During preprocessing randomly choose subset of points to be leaders the remaining are partitioned by which leader closest. For query processing, find leader(s) closest point. We then seek nearest for point among only clusters leader(s). Recursion may used both search. Such schemes approximate that "almost as good" neighbors. How good these approximations how much do they save computation.Our contributions are: (1) quantify metrics allow us tradeoff between processing quality neighbors; (2) give rigorous theoretical analysis our schemes, under natural generative processes (generalizing Gaussian mixtures) points; (3) experiments on synthetic from such processes, well document corpus, confirming orders magnitude cost at modest compromises retrieved points. In particular, show p-spheres, state-of-the-art solution, outperformed scheme whether stored main or external memo.

参考文章(24)

Raghu Ramakrishnan, Jonathan Goldstein, Contrast Plots and P-Sphere Trees: Space vs. Time in Nearest Neighbour Searches very large data bases. pp. 429- 440 ,(2000)

Piotr Indyk, Aristides Gionis, Rajeev Motwani, Similarity Search in High Dimensions via Hashing very large data bases. pp. 518- 529 ,(1999)

Sergey Brin, Near Neighbor Search in Large Metric Spaces very large data bases. pp. 574- 584 ,(1995)

Florence Jessie MacWilliams, Neil James Alexander Sloane, The Theory of Error-Correcting Codes ,(1977)

George Karypis, Michael Steinbach, Vipin Kumar, A Comparison of Document Clustering Techniques ,(2000)

Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft, When Is ''Nearest Neighbor'' Meaningful? international conference on database theory. pp. 217- 235 ,(1999) , 10.1007/3-540-49257-7_15

Kristin P. Bennett, Usama Fayyad, Dan Geiger, Density-based indexing for approximate nearest-neighbor queries knowledge discovery and data mining. pp. 233- 243 ,(1999) , 10.1145/312129.312236

Tolga Bozkaya, Meral Ozsoyoglu, Distance-based indexing for high-dimensional metric spaces international conference on management of data. ,vol. 26, pp. 357- 368 ,(1997) , 10.1145/253260.253345

Marshall Bern, Approximate closest-point queries in high dimensions Information Processing Letters. ,vol. 45, pp. 95- 99 ,(1993) , 10.1016/0020-0190(93)90222-U

10.

Herbert Edelsbrunner, Algorithms in Combinatorial Geometry ,(1987)

Finding near neighbors through cluster pruning

来源期刊

我的账户

Finding near neighbors through cluster pruning

来源期刊

相似文章 10

我的账户