Toward concept-based text understanding and mining

作者: Dan Roth , Xin Li

DOI:

关键词:

摘要: There is a huge amount of text information in the world, written natural languages. Understanding and effectively utilizing requires ability to disambiguate fragments at several levels, syntactically semantically, abstracting away details using background knowledge variety ways. One promising direction understanding real semantical sense, which human beings, that supporting intelligent access textual implement concept-based mining. That is, mechanism organizing, indexing, accessing discovering knowledge, centered around real-world concepts entities. Unfortunately, due difficulty caused by language ambiguity, most current text-related techniques still directly deal with syntactic individual mentions concepts, without considering concept as whole. A critical problem these lack capability resolve ambiguity text. A given entity---representing person, location or an organization---may be mentioned multiple, ambiguous Supporting resolving conceptual through Entity Reference Identification , particular, identifying entities from their mentions, within across documents, mapping them, hope discover organize identified This thesis systematically studies this fundamental towards We develop machine learning address different aspects it, including (1) a discriminative approach similar metrics capture appearance similarity between names; (2) a new supervised clustering framework, can partition set names some global optimization incorporate into clustering, guided supervision; (3) a generative probabilistic model, heart view on how documents are generated (of entity types) "sprinkled" them. show all approaches perform very accurately, range 90%--95% F1 measure for types, better than baselines previous (some of) problem. Our work also exhibits that, more domain-specific discovered incorporated identification, developed accordingly, achieve performance. In addition we extend model significant application related mining---semantic integration databases, based identification tracking.

参考文章(134)
W. W. Cohen and P. Ravikumar and S. Fienberg, A Comparison of String Metrics for Matching Names and Records ,(2003)
Claire Cardie, Kiri Wagstaff, Noun Phrase Coreference as Clustering empirical methods in natural language processing. ,(1999)
Roni Khardon, Leslie G. Valiant, Dan Roth, Relational Learning for NLP using Linear Threshold Elements international joint conference on artificial intelligence. ,vol. 2, pp. 911- 917 ,(1999)
Orest Bolohan, Adriana Badulescu, Roxana Girju, Paul Morarescu, Adrian Novischi, Dan I. Moldovan, Sanda M. Harabagiu, V. Finley Lacatusu, LCC Tools for Question Answering. text retrieval conference. ,(2002)
Paul Morie, Chad M. Cumby, Wen-tau Yih, Dan Roth, Ramya Nagarajan, Kevin Small, Nick Rizzolo, Xin Li, Question-Answering via Enhanced Understanding of Questions. text retrieval conference. ,(2002)
George Karypis, CLUTO - A Clustering Toolkit Defense Technical Information Center. ,(2002) , 10.21236/ADA439508
Usama Fayyad, Cory Reina, P. S. Bradley, Scaling clustering algorithms to large databases knowledge discovery and data mining. pp. 9- 15 ,(1998)
Jeff Pan, John F. Roddick, Seok Chin Chu, A comparative study and extensions to k-medoids algorithms ,(2001)