作者: Dan Roth , Xin Li
DOI:
关键词:
摘要: There is a huge amount of text information in the world, written natural languages. Understanding and effectively utilizing requires ability to disambiguate fragments at several levels, syntactically semantically, abstracting away details using background knowledge variety ways. One promising direction understanding real semantical sense, which human beings, that supporting intelligent access textual implement concept-based mining. That is, mechanism organizing, indexing, accessing discovering knowledge, centered around real-world concepts entities. Unfortunately, due difficulty caused by language ambiguity, most current text-related techniques still directly deal with syntactic individual mentions concepts, without considering concept as whole. A critical problem these lack capability resolve ambiguity text. A given entity---representing person, location or an organization---may be mentioned multiple, ambiguous Supporting resolving conceptual through Entity Reference Identification , particular, identifying entities from their mentions, within across documents, mapping them, hope discover organize identified This thesis systematically studies this fundamental towards We develop machine learning address different aspects it, including (1) a discriminative approach similar metrics capture appearance similarity between names; (2) a new supervised clustering framework, can partition set names some global optimization incorporate into clustering, guided supervision; (3) a generative probabilistic model, heart view on how documents are generated (of entity types) "sprinkled" them. show all approaches perform very accurately, range 90%--95% F1 measure for types, better than baselines previous (some of) problem. Our work also exhibits that, more domain-specific discovered incorporated identification, developed accordingly, achieve performance. In addition we extend model significant application related mining---semantic integration databases, based identification tracking.