Lemmatization and Morphosyntactic Tagging of Croatian and Serbian

作者: Nikola Ljubešić , Żeljko Agić , Danijela Merkler

DOI:

关键词:

摘要: We investigate state-of-the-art statistical models for lemmatization and morphosyntactic tagging of Croatian Serbian. The stem from a new manually annotated SETIMES.HR corpus Croatian, based on the SETimes parallel corpus. train text evaluate them samples Serbian two Wikipedias. Lemmatization accuracy languages reaches 97.87% 96.30%, while full using 600-tag tagset peaks at 87.72% 85.56%, respectively. Part speech accuracies reach 97.13% 96.46%. Results indicate that more complex methods Croatian-to- annotation projection are not required such dataset sizes these particular tasks. corpus, its resulting test sets all made freely available .

参考文章(21)
Nikola Ljubešić, Tomaž Erjavec, hrWaC and slWac: compiling web corpora for Croatian and Slovene text speech and dialogue. pp. 395- 402 ,(2011) , 10.1007/978-3-642-23538-2_50
Željko Agić, Danijela Merkler, Three Syntactic Formalisms for Data-Driven Dependency Parsing of Croatian text speech and dialogue. ,vol. 8082, pp. 560- 567 ,(2013) , 10.1007/978-3-642-40585-3_70
Hrvoje Peradin, Jan Šnajder, Towards a Constraint Grammar Based Morphological Tagger for Croatian text speech and dialogue. pp. 174- 182 ,(2012) , 10.1007/978-3-642-32790-2_21
Marko Tadić, Croatian Lemmatization Server University of Zagreb, Faculty of Humanities and Social Sciences. ,(2014)
Péter Halácsy, András Kornai, Csaba Oravecz, HunPos: an open source trigram tagger meeting of the association for computational linguistics. pp. 209- 212 ,(2007) , 10.3115/1557769.1557830
Marko Tadić, Sanja Fulgosi, Building the Croatian morphological lexicon Proceedings of the 2003 EACL Workshop on Morphological Processing of Slavic Languages - MorphSlav '03. pp. 41- 45 ,(2003) , 10.3115/1613200.1613206
TOMAŽ ERJAVEC, SASČO DŽEROSKI, MACHINE LEARNING OF MORPHOSYNTACTIC STRUCTURE: LEMMATIZING UNKNOWN SLOVENE WORDS Applied Artificial Intelligence. ,vol. 18, pp. 17- 41 ,(2004) , 10.1080/08839510490250088
Tomaž Erjavec, MULTEXT-East: morphosyntactic resources for Central and Eastern European languages language resources and evaluation. ,vol. 46, pp. 131- 142 ,(2012) , 10.1007/S10579-011-9174-8
Drahomíra "johanka" Spoustová, Jan Hajič, Jan Votrubec, Pavel Krbec, Pavel Květoň, The Best of Two Worlds: Cooperation of Statistical and Rule-Based Taggers for Czech meeting of the association for computational linguistics. pp. 67- 74 ,(2007) , 10.3115/1567545.1567558