作者: Mark Arehart
DOI:
关键词:
摘要: This paper compares several indexing methods for person names extracted from text, developed an information retrieval system with requirements fast approximate matching of noisy and multicultural Romanized names. Such algorithms are computationally expensive unacceptably slow when used without or blocking step. The goal is to create a small candidate pool containing all the true matches that can be exhaustively searched by more effective but slower name comparison method. In addition dramatically faster search, some evaluated here led modest gains in effectiveness eliminating false positives. Four techniques using either phonetic keys substrings segments, segment stopword lists, were combined three algorithms. On test set 700 queries run against 70K names, best-performing technique took just 2.1% as long naive exhaustive search increased F1 3 points, showing appropriate increase both speed effectiveness.