Making the Most of It: Word Sense Annotation and Disambiguation in the Face of Data Sparsity and Ambiguity

作者: David Alan Jurgens

DOI:

关键词:

摘要: Author(s): Jurgens, David Alan | Advisor(s): Dyer, Michael G Abstract: Natural language is highly ambiguous, with the same word having different meanings depending on context. While human readers often have no trouble interpreting correct meaning, semantic ambiguity poses a significant problem for many natural systems, such as those that translate text or perform machine reading. The task of identifying which meaning present in given context known Word Sense Disambiguation (WSD), where word's are discretized into units referred to senses. Because languages contain hundreds thousand unique words and each can multiple meanings, comprehensive sense-annotated corpora sparse, only tens low-hundreds annotated examples word. As result, creating high performance WSD systems requiring overcoming this data sparsity.This thesis provides three-fold approach improving face sparsity. First, we introduce two new algorithms take role lexicographer automatically learn senses from example uses fully unsupervised way. We then demonstrate these be combined limited amount create semi-supervised system significantly outperforms state-of-the-art supervised trained data. Second, propose novel method gathering high-quality sense annotations large numbers untrained, online workers, commonly crowdsourcing. Our lowers time cost building corpora, while maintaining level agreement between annotators, comparable experts. Third, analyze cases annotations, when annotators differ about best describes particular usage To analysis, built largest corpus explicitly marked. analysis revealed causes well how may interpreted resolved by applications using ambiguous complement work ambiguity, also introduced methodology evaluating report instances.

参考文章(182)
Sotiris Kotsiantis, Dimitris Kanellopoulos, Discretization Techniques: A recent survey ,(2006)
A. Kilgarriff, J. Rosenzweig, Framework and Results for English SENSEVAL Computers and The Humanities. ,vol. 34, pp. 15- 48 ,(2000) , 10.1023/A:1002693207386
David Hope, Bill Keller, MaxMax: a graph-based soft clustering algorithm applied to word sense induction international conference on computational linguistics. pp. 368- 381 ,(2013) , 10.1007/978-3-642-37247-6_30
Aljoscha Burchardt, Katrin Erk, Anette Frank, Andrea Kowalski, Sebastian Padó, Manfred Pinkal, The SALSA Corpus: a German Corpus Resource for Lexical Semantics language resources and evaluation. pp. 969- 974 ,(2006)
Georgiana Dinu, Mirella Lapata, Measuring Distributional Similarity in Context empirical methods in natural language processing. pp. 1162- 1172 ,(2010)
Ioannis Klapaftis, Suresh Manandhar, Sameer Pradhan, Dmitriy Dligach, SemEval-2010 Task 14: Word Sense Induction & Disambiguation meeting of the association for computational linguistics. pp. 63- 68 ,(2010)
Art Graesser, Mihai Lintean, Vasile Rus, Danielle McNamara, Assessing Student Paraphrases Using Lexical Semantics and Word Weighting artificial intelligence in education. pp. 165- 172 ,(2009) , 10.3233/978-1-60750-028-5-165
Akira Utsumi, Exploring the Relationship between Semantic Spaces and Semantic Relations. language resources and evaluation. ,(2010)
Stefan Bordag, Word Sense Induction: Triplet-Based Clustering and Automatic Evaluation conference of the european chapter of the association for computational linguistics. ,(2006)