Morphosyntactic Tagging of Slovene: Evaluating Taggers and Tagsets.

作者: Jakub Zavrel , Saso Dzeroski , Tomaz Erjavec

DOI:

关键词:

摘要: The paper evaluates tagging techniques on a corpus of Slovene, where we are faced with large number possible word-class tags and only small (hand-tagged) dataset. We report training testing four different taggers the Slovene MULTEXT-East containing about 100.000 words 1000 morphosyntactic tags. Results show, first all, that times Maximum Entropy Tagger Rule Based unacceptably long, while they negligible for Memory Taggers TnT tri-gram tagger. random split show accuracy varies between 86% 89% overall, 92% 95% known 54% 55% unknown words. Best results obtained by TnT. also investigates performance in relation to our EAGLES-based tagset. Here compare per-feature full tagset, accuracies these features when reduced PoS is quite high, Case lowest. Tagset reduction helps improve accuracy, but less than might be expected.

参考文章(14)
Nancy Ide, Tomaz Erjavec, The MULTEXT-East Corpus language resources and evaluation. pp. 971- 974 ,(1998)
Patrick Paroubek, Martin Rajman, Gilles Adda, Josette Lecomte, Joseph Mariani, The GRACE French Part-Of-Speech Tagging Evaluation Task language resources and evaluation. pp. 433- 441 ,(1998)
Dan Tufiş, Tiered Tagging and Combined Language Models Classifiers text speech and dialogue. pp. 28- 33 ,(1999) , 10.1007/3-540-48239-3_5
Jakub Zavrel, Peter Berck, Steven Gillis, Walter Daelemans, MBT: A Memory-Based Part of Speech Tagger-Generator international conference on computational linguistics. pp. 14- 27 ,(1996)
Jan Hajič, Morphological tagging: data vs. dictionaries north american chapter of the association for computational linguistics. pp. 94- 101 ,(2000)
Ludmila Dimitrova, Nancy Ide, Vladimir Petkevic, Tomaz Erjavec, Heiki Jaan Kaalep, Dan Tufis, Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages meeting of the association for computational linguistics. pp. 315- 319 ,(1998) , 10.3115/980845.980897
Thorsten Brants, TnT - Statistical Part-of-Speech Tagging Saarland University, Computational Linguistics. ,(2000)
Doug Cutting, Julian Kupiec, Jan Pedersen, Penelope Sibun, A Practical Part-of-Speech Tagger conference on applied natural language processing. pp. 133- 140 ,(1992) , 10.3115/974499.974523
Eric Brill, A Simple Rule-Based Part of Speech Tagger conference on applied natural language processing. pp. 152- 155 ,(1992) , 10.3115/974499.974526