作者: Flavio Chierichetti , Alessandro Panconesi , Prabhakar Raghavan , Mauro Sozio , Alessandro Tiberi
关键词: Data point 、 Synthetic data 、 Nearest neighbor search 、 Data mining 、 Nearest-neighbor chain algorithm 、 Algorithm 、 Generative model 、 k-nearest neighbors algorithm 、 Cluster analysis 、 Fixed-radius near neighbors 、 Mathematics
摘要: Finding near(est) neighbors is a classic, difficult problem in data management and retrieval, with applications text image search,in finding similar objects matching patterns. Here we study cluster pruning, an extremely simple randomized technique. During preprocessing randomly choose subset of points to be leaders the remaining are partitioned by which leader closest. For query processing, find leader(s) closest point. We then seek nearest for point among only clusters leader(s). Recursion may used both search. Such schemes approximate that "almost as good" neighbors. How good these approximations how much do they save computation.Our contributions are: (1) quantify metrics allow us tradeoff between processing quality neighbors; (2) give rigorous theoretical analysis our schemes, under natural generative processes (generalizing Gaussian mixtures) points; (3) experiments on synthetic from such processes, well document corpus, confirming orders magnitude cost at modest compromises retrieved points. In particular, show p-spheres, state-of-the-art solution, outperformed scheme whether stored main or external memo.