Corpus vs. Lexicon Supervision in Morphosyntactic Tagging: the Case of Slovene

作者: Nikola Ljubesic , Tomaz Erjavec

DOI:

关键词: Slavic languagesNatural language processingComputer scienceLexiconArtificial intelligence

摘要: In this paper we present a tagger developed for inflectionally rich languages which both training corpus and lexicon are available. We do not constrain the by entries, allowing incompleteness noisiness. By using indirectly through features allow known unknown words to be tagged in same manner. test our on Slovene data, obtaining 25% error reduction of best previous results words. Given that is, comparison some other Slavic languages, well-resourced language, perform experiments impact token (corpus) vs. type (lexicon) supervision, useful insights how balance effort extending resources yield better tagging results.

参考文章(13)
Adam Radziszewski, A Tiered CRF Tagger for Polish Intelligent Tools for Building a Scientific Information Platform. pp. 215- 230 ,(2013) , 10.1007/978-3-642-35647-6_16
Adwait Ratnaparkhi, A Maximum Entropy Model for Part-Of-Speech Tagging empirical methods in natural language processing. ,(1996)
Péter Halácsy, András Kornai, Csaba Oravecz, HunPos: an open source trigram tagger meeting of the association for computational linguistics. pp. 209- 212 ,(2007) , 10.3115/1557769.1557830
Kristina Toutanova, Dan Klein, Christopher D. Manning, Yoram Singer, Feature-rich part-of-speech tagging with a cyclic dependency network Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - NAACL '03. pp. 173- 180 ,(2003) , 10.3115/1073445.1073478
Pascal Denis, Benoît Sagot, Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging language resources and evaluation. ,vol. 46, pp. 721- 736 ,(2012) , 10.1007/S10579-012-9193-0
Miroslav Spousta, Drahomíra "johanka" Spoustová, Jan Hajič, Jan Raab, Semi-Supervised Training for the Averaged Perceptron POS Tagger meeting of the association for computational linguistics. pp. 763- 771 ,(2009) , 10.3115/1609067.1609152
Edward M. McCreight, A Space-Economical Suffix Tree Construction Algorithm Journal of the ACM. ,vol. 23, pp. 262- 272 ,(1976) , 10.1145/321941.321946
Eduard BejÄek, Pavel Straň'ak, ZdenÄ›k Žabokrtský, Magda Å evÄ'ikov'a, Jan Å tÄ›p'anek, Jan Popelka, Jarmila Panevov'a, Prague Dependency Treebank 2.5 -- a Revisited Version of PDT 2.0 international conference on computational linguistics. pp. 231- 246 ,(2012)
Nikola Ljubešić, Żeljko Agić, Danijela Merkler, Lemmatization and Morphosyntactic Tagging of Croatian and Serbian meeting of the association for computational linguistics. pp. 48- 57 ,(2013)
Lukasz Kobyli'nski, PoliTa: A multitagger for Polish language resources and evaluation. pp. 2949- 2954 ,(2014)