作者: Jason Baldridge , Grant DeLozier , Loretta London
DOI:
关键词: Geoparsing 、 Word (computer architecture) 、 Language model 、 Entity linking 、 Computer science 、 Set (abstract data type) 、 Spatial analysis 、 Resolution (logic) 、 Information retrieval 、 Web content
摘要: Toponym resolution, or grounding names of places to their actual locations, is an important problem in analysis both historical corpora and present-day news web content. Recent approaches have shifted from rule-based spatial minimization methods machine learned classifiers that use features the text surrounding a toponym. Such been shown be highly effective, but they crucially rely on gazetteers are unable handle unknown place locations. We address this limitation by modeling geographic distributions words over earth's surface: we calculate profile each word based local statistics set geo-referenced language models. These geo-profiles can further refined combining in-domain data with background Wikipedia. Our resolver computes overlap all given span; without using gazetteer, it performs par existing classifiers. When combined achieves state-of-the-art performance for two standard toponym resolution (TR-CoNLL Civil War). Furthermore, dramatically improves recall when toponyms identified named entity recognizers, which often (correctly) find non-standard variants toponyms.