Improving Morphosyntactic Tagging of Slovene Language through Meta-tagging

作者: Tomaz Erjavec , Jan Rupnik , Miha Grcar

DOI: 10.31449/INF.V32I4.215

关键词: Natural language processingSpeech recognitionHidden Markov modelClassifier (linguistics)Artificial intelligenceSlovene languageComputer scienceStatistical classificationFeature (machine learning)Task (project management)Language technology

摘要: Part-of-speech (PoS) or, better, morphosyntactic tagging is the process of assigning categories to words in a text, an important pre-processing step for most human language technology applications. PoS-tagging Slovene texts challenging task since size tagset over one thousand tags (as opposed English, where typically around sixty) and state-of-the-art accuracy still below levels desired. The paper describes experiment aimed at improving Slovene, by combining outputs two taggers – proprietary rule-based tagger developed Amebis HLT company, TnT, tri-gram HMM tagger, trained on handannotated corpus Slovene. have comparable accuracy, but there are many cases where, if predictions differ, does assign correct tag. We investigate training classifier top both that predicts which correct. with selecting different classification algorithms constructing feature sets show some yield meta-tagger significant increase compared either isolation.

参考文章(43)
Tomaz Erjavec, Simon Krek, The JOS Morphosyntactically Tagged Corpus of Slovene language resources and evaluation. ,(2008)
Alexander Herrigel, Sviatoslav Voloshynovskiy, Yuriy Rytsar, The watermark template attack Security and watermarking of multimedia contents. Conference. pp. 4314- 4346 ,(2001)
Janez Demšar, Blaž Zupan, Gregor Leban, Tomaz Curk, Orange: from experimental machine learning to interactive data mining european conference on principles of data mining and knowledge discovery. pp. 537- 539 ,(2004) , 10.1007/978-3-540-30116-5_58
Bernd Girod, Joachim J. Eggers, Jonathan K. Su, Capacity of digital watermarks subjected to an optimal collusion attack european signal processing conference. pp. 1- 4 ,(2000) , 10.5281/ZENODO.37135
Fabien A. P. Petitcolas, Ross J. Anderson, Markus G. Kuhn, Attacks on Copyright Marking Systems information hiding. pp. 218- 238 ,(1998) , 10.1007/3-540-49380-8_16
Chiraz Ben Othmane Zribi, Aroua Torjmen, Mohamed Ben Ahmed, An efficient multi-agent system combining POS-Taggers for arabic texts international conference on computational linguistics. pp. 121- 131 ,(2006) , 10.1007/11671299_15
Joshua R. Smith, Barrett O. Comiskey, Modulation and Information Hiding in Images information hiding. pp. 207- 226 ,(1996) , 10.1007/3-540-61996-8_42
M. Kutter, S.K. Bhattacharjee, T. Ebrahimi, Towards second generation watermarking schemes international conference on image processing. ,vol. 1, pp. 320- 323 ,(1999) , 10.1109/ICIP.1999.821622
J. Domingo-Ferrer, J. Herrera-Joancomarti, Simple collusion-secure fingerprinting schemes for images international conference on information technology coding and computing. pp. 128- 132 ,(2000) , 10.1109/ITCC.2000.844195
Matt Cutts, An introduction to the GIMP ACM Crossroads Student Magazine. ,vol. 3, pp. 28- 30 ,(1997) , 10.1145/270955.270972