Context-based morphological disambiguation with random fields

作者: Noah A. Smith , David A. Smith , Roy W. Tromble

DOI: 10.3115/1220575.1220635

关键词: Context (language use)SentenceLemmatisationArtificial intelligenceArabicCzechMorphemeWord (computer architecture)Morphological dictionaryComputer scienceSpeech recognitionNatural language processing

摘要: Finite-state approaches have been highly successful at describing the morphological processes of many languages. Such largely focused on modeling phone- or character-level that generate candidate lexical types, rather than tokens in context. For full analysis words context, disambiguation is also required (Hakkani-Tur et al., 2000; Hajic 2001). In this paper, we apply a novel source-channel model to problem (segmentation into morphemes, lemmatization, and POS tagging) for concatenative, templatic, inflectional The channel exploits an existing dictionary, constraining each word's be linguistically valid. source factored, conditionally-estimated random field (Lafferty 2001) learns disambiguate sentence by local contexts. Compared with baseline state-of-the-art methods, our method achieves statistically significant error rate reductions Korean, Arabic, Czech, various training set sizes accuracy measures.

参考文章(26)
Kenneth R. Beesley, Lauri Karttunen, Finite State Morphology ,(2003)
Ezra Daya, Shuly Wintner, Dan Roth, Learning Hebrew Roots: Machine Learning with Linguistic Constraints empirical methods in natural language processing. pp. 357- 364 ,(2004)
Noah A. Smith, David A. Smith, Bilingual Parsing with Factored Estimation: Using English to Parse Korean empirical methods in natural language processing. pp. 49- 56 ,(2004)
Jaroslava Hlaváčová, Morphological Guesser of Czech Words text speech and dialogue. pp. 70- 75 ,(2001) , 10.1007/3-540-44805-5_9
Kaoru Yamamoto, Yuji Matsumoto, Taku Kudo, Applying Conditional Random Fields to Japanese Morphological Analysis empirical methods in natural language processing. pp. 230- 237 ,(2004)
Jong-Hyeok Lee, Geunbae Lee, Jeongwon Cha, Generalized unknown morpheme guessing for hybrid POS tagging of Korean meeting of the association for computational linguistics. ,(1998)
Uzzi Ornan, Alon Itai, Moshe Levinger, Learning morpho-lexical probabilities from an untagged corpus with an application to Hebrew Computational Linguistics. ,vol. 21, pp. 383- 404 ,(1995)
Chung-hye Han, Martha Palmer, Na-Rare Han, Heejong Yi, Eon-Suk Ko, Penn Korean Treebank : Development and Evaluation pacific asia conference on language information and computation. pp. 69- 78 ,(2002)