Indexing Methods for Faster and More Effective Person Name Search.

作者: Mark Arehart

DOI:

关键词:

摘要: This paper compares several indexing methods for person names extracted from text, developed an information retrieval system with requirements fast approximate matching of noisy and multicultural Romanized names. Such algorithms are computationally expensive unacceptably slow when used without or blocking step. The goal is to create a small candidate pool containing all the true matches that can be exhaustively searched by more effective but slower name comparison method. In addition dramatically faster search, some evaluated here led modest gains in effectiveness eliminating false positives. Four techniques using either phonetic keys substrings segments, segment stopword lists, were combined three algorithms. On test set 700 queries run against 70K names, best-performing technique took just 2.1% as long naive exhaustive search increased F1 3 points, showing appropriate increase both speed effectiveness.

参考文章(22)
W. W. Cohen and P. Ravikumar and S. Fienberg, A Comparison of String Metrics for Matching Names and Records ,(2003)
Peter Christen, Towards Parameter-free Blocking for Scalable Record Linkage Canberra, ACT: Dept. of Computer Science, Faculty of Engineering and Information Technology, The Australian National University. ,(2007)
Keith J. Miller, Mark Arehart, A Ground Truth Dataset for Matching Culturally Diverse Romanized Person Names language resources and evaluation. ,(2008)
Mark Arehart, Keith J. Miller, Elizabeth Schroeder, Kenneth Samuel, Vanesa Jurica, John Polk, James Finley, Sarah McLeod, Improving Personal Name Search in the TIGR System. language resources and evaluation. ,(2010)
Mark Arehart, Keith J. Miller, Chris Wolf, Adjudicator Agreement and System Rankings for Person Name Search language resources and evaluation. ,(2008)
Erkki Sutinen, Ricardo A. Baeza-Yates, Jorma Tarhio, Gonzalo Navarro, Indexing methods for approximate string matching IEEE Data(base) Engineering Bulletin. ,vol. 24, pp. 19- 27 ,(2001)
Norbert Fuhr, Thomas Poersch, Ulrich Pfeifer, Searching Proper Names in Databases. HIM. pp. 259- 275 ,(1995)
Stephen E. Fienberg, William W. Cohen, Pradeep Ravikumar, A comparison of string distance metrics for name-matching tasks international joint conference on artificial intelligence. pp. 73- 78 ,(2003)