Entity matching for intelligent information integration

作者: Hsinchun Chen , Gang Wang

DOI:

关键词: Information systemHeuristicDecision ruleArtificial intelligenceNaive Bayes classifierMachine learningInformation integrationComputer scienceMatching (statistics)Probabilistic logicData miningFeature selection

摘要: Due to the rapid development of information technologies, especially network business activities have never been as integrated they are now. Business decision making often requires gathering from different sources. This dissertation focuses on problem entity matching, associating corresponding elements within or across systems. It is devoted providing complete and accurate for making. Three challenges identified that may affect matching performance: feature selection representative, techniques, searching strategy. first provides a theoretical foundation by connecting similarity categorization theories developed in field cognitive science. The provide guidance tackling three identified. First, based contrast model, we propose case-study-based methodology identifies key features uniquely identify an entity. Second, record comparison technique multi-layer naive Bayes model correspond respectively deterministic probability response models defined theory. Experiments show both techniques effective linking deceptive criminal identities. However, probabilistic preferable because it uses semi-supervised learning method, which less human intervention during training. Third, prototype access assumption proposed theory, apply adaptive detection algorithm so efficiency can be greatly improved reduced search space. this significantly improves without significant accuracy loss. Based above findings Arizona IDMatcher, identity system method. We compare against IBM Identity Resolution tool, leading commercial product using heuristic rules. do not suggest clear winner, but pros cons each system. IDMatcher able capture more true matches than (i.e., high recall). On other hand, mostly precision).

参考文章(94)
Terry A. Landers, Ronni Rosenberg, An Overview of MULTIBASE. DDB. pp. 153- 184 ,(1982)
S. Obeng-Manu Gyimah, Missing Data in Quantitative Social Research Western University. ,vol. 15, pp. 1- ,(2001)
S. F. Buck, A Method of Estimation of Missing Values in Multivariate Data Suitable for Use with an Electronic Computer Journal of the royal statistical society series b-methodological. ,vol. 22, pp. 302- 306 ,(1960) , 10.1111/J.2517-6161.1960.TB00375.X
Roger Clarke, Human Identification in Information Systems Information Technology & People. ,vol. 7, pp. 6- 37 ,(1994) , 10.1108/09593849410076799
Won Kim, Byoung-Ju Choi, Eui-Kyeong Hong, Soo-Kyung Kim, Doheon Lee, A Taxonomy of Dirty Data Data Mining and Knowledge Discovery. ,vol. 7, pp. 81- 99 ,(2003) , 10.1023/A:1021564703268
Judith S Donath, Identity and deception in the virtual community Communities in Cyberspace. pp. 37- 68 ,(2002) , 10.4324/9780203194959-11