Load-Balancing the Distance Computations in Record Linkage

作者: Dimitrios Karapiperis , Vassilios S. Verykios

DOI: 10.1145/2830544.2830546

关键词:

摘要: In this paper, we propose a novel method for distributing the distance computations of record pairs generated by blocking mechanism to reduce tasks Map/Reduce system. The proposed solutions in literature analyze blocks and then construct profile, which contains number each block. However, deterministic process, including all its variants, might incur considerable overhead given massive data sets. contrast, our utilizes two jobs where first job formulates while second distributes these tasks, perform computations, using repetitive allocation rounds. such round, utilize available on random basis generating permutations their indexes. A series experiments demonstrate an almost-equal distribution pairs, or equivalently makes simple, yet efficient, solution applying

参考文章(75)
Margarita Alonso Ramos, Sabela Prieto González, Gerard Casamayor del Bosque, Orsolya Vincze, Nancy Vázquez Veiga, Estela Mosqueira Suárez, Leo Wanner, Towards a Motivated Annotation Schema of Collocation Errors in Learner Corpora language resources and evaluation. ,(2010)
Jie Lou, Kai Hin Lim, Yulin Fang, Zeyu Peng, None, Drivers Of Knowledge Contribution Quality And Quantity In Online Question And Answering Communities pacific asia conference on information systems. pp. 121- ,(2011)
Kanoksri Sarinnapakorn, Mei-Ling Shyu, Shu-Ching Chen, LiWu Chang, A Novel Anomaly Detection Scheme Based on Principal Component Classifier international conference on data mining. pp. 172- 179 ,(2003)
Shebuti Rayana, Leman Akoglu, Less is More: Building Selective Anomaly Ensembles ACM Transactions on Knowledge Discovery From Data. ,vol. 10, pp. 42- ,(2016) , 10.1145/2890508
Howard T. Welser, Danyel Fisher, Marc A. Smith, Eric Gleave, Visualizing the Signatures of Social Roles in Online Discussion Groups Journal of Social Structure. ,vol. 8, ,(2007)
Piotr Indyk, Aristides Gionis, Rajeev Motwani, Similarity Search in High Dimensions via Hashing very large data bases. pp. 518- 529 ,(1999)
Fabrizio Angiulli, Clara Pizzuti, Fast Outlier Detection in High Dimensional Spaces european conference on principles of data mining and knowledge discovery. pp. 15- 26 ,(2002) , 10.1007/3-540-45681-3_2
Zengyou He, Shengchun Deng, Xiaofei Xu, A Unified Subspace Outlier Ensemble Framework for Outlier Detection Advances in Web-Age Information Management. pp. 632- 637 ,(2005) , 10.1007/11563952_56