Scaling k-Nearest Neighbours Queries (The Right Way)

作者: Atoshum Cahsai , Nikos Ntarmos , Christos Anagnostopoulos , Peter Triantafillou , None

DOI: 10.1109/ICDCS.2017.267

关键词:

摘要: Recently parallel / distributed processing approaches have been proposed for k-Nearest Neighbours (kNN) queries over very large (multidimensional) datasets aiming to ensure scalability. However, this is typically achieved at the expense of efficiency. With paper we offer a novel approach that alleviates performance problems associated with state art methods. The essence our approach, which differentiates it from related research, rests on (i) adopting coordinator-based algorithm, instead those employed data-parallel executionengines (such as Hadoop/MapReduce or Spark), and (ii) way organize data, structure computation, index stored ensures only small number data items are retrieved underlying store, communicated network, processed by coordinatorfor every kNN query. Our also pays special attention ensuring scalability in addition low query times. Overall, can be just tens milliseconds (as opposed (tens of) seconds required art. We implemented usinga NoSQL DB (HBase) compare against state-of-the-art: Hadoop-based Spatial Hadoop (SHadoop) Spark-based Simba employ different various sizes, showcasing contributed advantages. outperforms stateof art, 2-3 orders magnitude, consistently dataset sizes ranging hundreds millions billions points. show key constituent overheads incurred during network bandwidth, time coordinator) scale well, overall approach.

参考文章(28)
Simin You, Jianting Zhang, Le Gruenwald, Large-scale spatial join query processing in Cloud international conference on data engineering. pp. 34- 41 ,(2015) , 10.1109/ICDEW.2015.7129541
Jon L. Bentley, Donald F. Stanat, E.Hollins Williams, The complexity of finding fixed-radius near neighbors☆ Information Processing Letters. ,vol. 6, pp. 209- 212 ,(1977) , 10.1016/0020-0190(77)90070-9
Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica, Discretized streams: fault-tolerant streaming computation at scale symposium on operating systems principles. pp. 423- 438 ,(2013) , 10.1145/2517349.2522737
Liu Jiang, Bing Li, Meina Song, THE optimization of HDFS based on small files 2010 3rd IEEE International Conference on Broadband Network and Multimedia Technology (IC-BNMT). pp. 912- 915 ,(2010) , 10.1109/ICBNMT.2010.5705223
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber, Bigtable ACM Transactions on Computer Systems. ,vol. 26, pp. 1- 26 ,(2008) , 10.1145/1365815.1365816
Reynold S Xin, Joseph E Gonzalez, Michael J Franklin, Ion Stoica, EECS AMPLab, GraphX: a resilient distributed graph system on Spark First International Workshop on Graph Data Management Experiences and Systems. pp. 2- ,(2013) , 10.1145/2484425.2484427
Dan Han, Eleni Stroulia, HGrid: A Data Model for Large Geospatial Data Sets in HBase international conference on cloud computing. pp. 910- 917 ,(2013) , 10.1109/CLOUD.2013.78
R. A. Finkel, J. L. Bentley, Quad trees a data structure for retrieval on composite keys Acta Informatica. ,vol. 4, pp. 1- 9 ,(1974) , 10.1007/BF00288933
Shoji Nishimura, Sudipto Das, Divyakant Agrawal, Amr El Abbadi, $\mathcal{MD}$-HBase: design and implementation of an elastic data infrastructure for cloud-scale location services Distributed and Parallel Databases. ,vol. 31, pp. 289- 319 ,(2013) , 10.1007/S10619-012-7109-Z
Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, Matei Zaharia, None, Spark SQL: Relational Data Processing in Spark international conference on management of data. pp. 1383- 1394 ,(2015) , 10.1145/2723372.2742797