作者: Tomaz Erjavec , Jan Rupnik , Miha Grcar
关键词: Natural language processing 、 Speech recognition 、 Hidden Markov model 、 Classifier (linguistics) 、 Artificial intelligence 、 Slovene language 、 Computer science 、 Statistical classification 、 Feature (machine learning) 、 Task (project management) 、 Language technology
摘要: Part-of-speech (PoS) or, better, morphosyntactic tagging is the process of assigning categories to words in a text, an important pre-processing step for most human language technology applications. PoS-tagging Slovene texts challenging task since size tagset over one thousand tags (as opposed English, where typically around sixty) and state-of-the-art accuracy still below levels desired. The paper describes experiment aimed at improving Slovene, by combining outputs two taggers – proprietary rule-based tagger developed Amebis HLT company, TnT, tri-gram HMM tagger, trained on handannotated corpus Slovene. have comparable accuracy, but there are many cases where, if predictions differ, does assign correct tag. We investigate training classifier top both that predicts which correct. with selecting different classification algorithms constructing feature sets show some yield meta-tagger significant increase compared either isolation.