作者: Jakub Zavrel , Saso Dzeroski , Tomaz Erjavec
DOI:
关键词:
摘要: The paper evaluates tagging techniques on a corpus of Slovene, where we are faced with large number possible word-class tags and only small (hand-tagged) dataset. We report training testing four different taggers the Slovene MULTEXT-East containing about 100.000 words 1000 morphosyntactic tags. Results show, first all, that times Maximum Entropy Tagger Rule Based unacceptably long, while they negligible for Memory Taggers TnT tri-gram tagger. random split show accuracy varies between 86% 89% overall, 92% 95% known 54% 55% unknown words. Best results obtained by TnT. also investigates performance in relation to our EAGLES-based tagset. Here compare per-feature full tagset, accuracies these features when reduced PoS is quite high, Case lowest. Tagset reduction helps improve accuracy, but less than might be expected.