作者: Patrick Schone , Alexander E. Richman
DOI:
关键词: Ukrainian 、 Less Commonly Taught Languages 、 Entity linking 、 Named entity 、 Portuguese 、 Natural language processing 、 Computer science 、 Process (engineering) 、 Named-entity recognition 、 Artificial intelligence
摘要: In this paper, we describe a system by which the multilingual characteristics of Wikipedia can be utilized to annotate large corpus text with Named Entity Recognition (NER) tags requiring minimal human intervention and no linguistic expertise. This process, though value in languages for resources exist, is particularly useful less commonly taught languages. We show how format used identify possible named entities discuss detail process use Category structure inherent determine entity type proposed entity. further methods English language data bootstrap NER other demonstrate using generated as training sets variant BBN's Identifinder French, Ukrainian, Spanish, Polish, Russian, Portuguese, achieving overall F-scores high 84.7% on independent, human-annotated corpora, comparable trained up 40,000 words newswire.