作者: Atoshum Cahsai , Nikos Ntarmos , Christos Anagnostopoulos , Peter Triantafillou , None
关键词:
摘要: Recently parallel / distributed processing approaches have been proposed for k-Nearest Neighbours (kNN) queries over very large (multidimensional) datasets aiming to ensure scalability. However, this is typically achieved at the expense of efficiency. With paper we offer a novel approach that alleviates performance problems associated with state art methods. The essence our approach, which differentiates it from related research, rests on (i) adopting coordinator-based algorithm, instead those employed data-parallel executionengines (such as Hadoop/MapReduce or Spark), and (ii) way organize data, structure computation, index stored ensures only small number data items are retrieved underlying store, communicated network, processed by coordinatorfor every kNN query. Our also pays special attention ensuring scalability in addition low query times. Overall, can be just tens milliseconds (as opposed (tens of) seconds required art. We implemented usinga NoSQL DB (HBase) compare against state-of-the-art: Hadoop-based Spatial Hadoop (SHadoop) Spark-based Simba employ different various sizes, showcasing contributed advantages. outperforms stateof art, 2-3 orders magnitude, consistently dataset sizes ranging hundreds millions billions points. show key constituent overheads incurred during network bandwidth, time coordinator) scale well, overall approach.