Efficient Entity Resolution on Heterogeneous Records

作者: Yiming Lin , Hongzhi Wang , Jianzhong Li , Hong Gao

DOI: 10.1109/TKDE.2019.2898191

关键词: Data miningSchema matchingData integrationInformation retrievalComputer scienceData exchangeSchema (psychology)

摘要: Entity resolution (ER) is the problem of identifying and merging records that refer to same real-world entity. In many scenarios, raw are stored under heterogeneous environment. Specifically, schemas may differ from each other. To leverage such better, most existing work assume schema matching data exchange have been done convert different those a predefined schema. However, we observe would lose information in some cases, which could be useful or even crucial ER. sufficient sources, this paper, address several challenges ER on show none similarity metrics their transformations applied find similar settings. Motivated by this, design function propose novel framework iteratively Regarding efficiency, build an index generate candidates accelerate computation. Evaluations datasets effectiveness efficiency our methods.

参考文章(25)
Wei Wang, Similarity Join Algorithms: An Introduction. SEBD. pp. 2- ,(2008)
Mayank Kejriwal, Daniel P. Miranker, An unsupervised instance matcher for schema-free RDF data Journal of Web Semantics. ,vol. 35, pp. 102- 123 ,(2015) , 10.1016/J.WEBSEM.2015.07.002
Douglas Brent West, Introduction to Graph Theory ,(1995)
Vassilis Christophides, Kostas Stefanidis, Vasilis Efthymiou, Melanie Herschel, Entity Resolution in the Web of Data ,(2015)
Simon Lacoste-Julien, Konstantina Palla, Alex Davies, Gjergji Kasneci, Thore Graepel, Zoubin Ghahramani, None, SIGMa: simple greedy matching for aligning large knowledge bases knowledge discovery and data mining. pp. 572- 580 ,(2013) , 10.1145/2487575.2487592
George Papadakis, Ekaterini Ioannou, Claudia Niederée, Peter Fankhauser, Efficient entity resolution for large heterogeneous information spaces web search and data mining. pp. 535- 544 ,(2011) , 10.1145/1935826.1935903
Nick Koudas, Sunita Sarawagi, Divesh Srivastava, Record linkage: similarity measures and algorithms international conference on management of data. pp. 802- 803 ,(2006) , 10.1145/1142473.1142599
Jiannan Wang, Tim Kraska, Michael J. Franklin, Jianhua Feng, CrowdER Proceedings of the VLDB Endowment. ,vol. 5, pp. 1483- 1494 ,(2012) , 10.14778/2350229.2350263
Nir Ailon, Moses Charikar, Alantha Newman, Aggregating inconsistent information Journal of the ACM. ,vol. 55, pp. 1- 27 ,(2008) , 10.1145/1411509.1411513
Christoph Böhm, Gerard de Melo, Felix Naumann, Gerhard Weikum, LINDA Proceedings of the 21st ACM international conference on Information and knowledge management - CIKM '12. pp. 2104- 2108 ,(2012) , 10.1145/2396761.2398582