作者: David Alan Jurgens
DOI:
关键词:
摘要: Author(s): Jurgens, David Alan | Advisor(s): Dyer, Michael G Abstract: Natural language is highly ambiguous, with the same word having different meanings depending on context. While human readers often have no trouble interpreting correct meaning, semantic ambiguity poses a significant problem for many natural systems, such as those that translate text or perform machine reading. The task of identifying which meaning present in given context known Word Sense Disambiguation (WSD), where word's are discretized into units referred to senses. Because languages contain hundreds thousand unique words and each can multiple meanings, comprehensive sense-annotated corpora sparse, only tens low-hundreds annotated examples word. As result, creating high performance WSD systems requiring overcoming this data sparsity.This thesis provides three-fold approach improving face sparsity. First, we introduce two new algorithms take role lexicographer automatically learn senses from example uses fully unsupervised way. We then demonstrate these be combined limited amount create semi-supervised system significantly outperforms state-of-the-art supervised trained data. Second, propose novel method gathering high-quality sense annotations large numbers untrained, online workers, commonly crowdsourcing. Our lowers time cost building corpora, while maintaining level agreement between annotators, comparable experts. Third, analyze cases annotations, when annotators differ about best describes particular usage To analysis, built largest corpus explicitly marked. analysis revealed causes well how may interpreted resolved by applications using ambiguous complement work ambiguity, also introduced methodology evaluating report instances.