The SETimes.HR Linguistically Annotated Corpus of Croatian

作者: Nikola Ljubeši'c , Żeljko Agi'c

DOI:

关键词:

摘要: We present SETimes.HR ― the first linguistically annotated corpus of Croatian that is freely available for all purposes. The built on top SETimes parallel nine Southeast European languages and English. It manually lemmas, morphosyntactic tags, named entities dependency syntax. couple with domain-sensitive test sets Serbian to support direct model transfer evaluation between these closely related languages. build evaluate statistical models lemmatization, tagging, entity recognition parsing sets, providing state art in tasks. make resources presented paper under a very permissive licensing scheme.

参考文章(26)
Darja Fiser, Nina Ledinek, Simon Krek, Tomaz Erjavec, The JOS Linguistically Tagged Corpus of Slovene language resources and evaluation. ,(2010)
Nikola Ljubešić, Tomaž Erjavec, hrWaC and slWac: compiling web corpora for Croatian and Slovene text speech and dialogue. pp. 395- 402 ,(2011) , 10.1007/978-3-642-23538-2_50
Alena Böhmová, Jan Hajič, Eva Hajičová, Barbora Hladká, The Prague Dependency Treebank Treebanks. pp. 103- 127 ,(2003) , 10.1007/978-94-010-0201-1_7
Bernd Bohnet, Top Accuracy and Fast Dependency Parsing is not a Contradiction international conference on computational linguistics. pp. 89- 97 ,(2010)
Željko Agić, Danijela Merkler, Three Syntactic Formalisms for Data-Driven Dependency Parsing of Croatian text speech and dialogue. ,vol. 8082, pp. 560- 567 ,(2013) , 10.1007/978-3-642-40585-3_70
JOAKIM NIVRE, JOHAN HALL, JENS NILSSON, ATANAS CHANEV, GÜLŞEN ERYİGİT, SANDRA KÜBLER, SVETOSLAV MARINOV, ERWIN MARSI, MaltParser: A language-independent system for data-driven dependency parsing Natural Language Engineering. ,vol. 13, pp. 95- 135 ,(2005) , 10.1017/S1351324906004505
Marko Tadić, Croatian Lemmatization Server University of Zagreb, Faculty of Humanities and Social Sciences. ,(2014)
Péter Halácsy, András Kornai, Csaba Oravecz, HunPos: an open source trigram tagger meeting of the association for computational linguistics. pp. 209- 212 ,(2007) , 10.3115/1557769.1557830
Alexander Clark, Combining distributional and morphological information for part of speech induction conference of the european chapter of the association for computational linguistics. pp. 59- 66 ,(2003) , 10.3115/1067807.1067817
Sabine Buchholz, Erwin Marsi, CoNLL-X Shared Task on Multilingual Dependency Parsing conference on computational natural language learning. pp. 149- 164 ,(2006) , 10.3115/1596276.1596305