作者: Hsinchun Chen , Gang Wang
DOI:
关键词: Information system 、 Heuristic 、 Decision rule 、 Artificial intelligence 、 Naive Bayes classifier 、 Machine learning 、 Information integration 、 Computer science 、 Matching (statistics) 、 Probabilistic logic 、 Data mining 、 Feature selection
摘要: Due to the rapid development of information technologies, especially network business activities have never been as integrated they are now. Business decision making often requires gathering from different sources. This dissertation focuses on problem entity matching, associating corresponding elements within or across systems. It is devoted providing complete and accurate for making. Three challenges identified that may affect matching performance: feature selection representative, techniques, searching strategy. first provides a theoretical foundation by connecting similarity categorization theories developed in field cognitive science. The provide guidance tackling three identified. First, based contrast model, we propose case-study-based methodology identifies key features uniquely identify an entity. Second, record comparison technique multi-layer naive Bayes model correspond respectively deterministic probability response models defined theory. Experiments show both techniques effective linking deceptive criminal identities. However, probabilistic preferable because it uses semi-supervised learning method, which less human intervention during training. Third, prototype access assumption proposed theory, apply adaptive detection algorithm so efficiency can be greatly improved reduced search space. this significantly improves without significant accuracy loss. Based above findings Arizona IDMatcher, identity system method. We compare against IBM Identity Resolution tool, leading commercial product using heuristic rules. do not suggest clear winner, but pros cons each system. IDMatcher able capture more true matches than (i.e., high recall). On other hand, mostly precision).