DerivBase.hr: A High-Coverage Derivational Morphology Resource for Croatian

作者: Jan Šnajder

DOI:

关键词:

摘要: Knowledge about derivational morphology has been proven useful for a number of natural language processing (NLP) tasks. We describe the construction and evaluation DerivBase.hr, large-coverage morphological resource Croatian. DerivBase.hr groups 100k lemmas from web corpus hrWaC into 56k clusters derivationally related lemmas, so-called families. focus on suffixal derivation between within nouns, verbs, adjectives. propose two approaches: an unsupervised approach knowledge-based based hand-crafted model but without using any additional lexico-semantic resources The acquisition procedure consists three steps: preprocessing, inflectional lexicon, induction methodology manually constructed families which we sample annotate pairs lemmas. evaluate so-obtained sample, show that version attains good clustering quality 81.2% precision, 76.5% recall, 78.8% F1 -score. As with similar other languages, expect to be NLP

参考文章(27)
Nikola Ljubešić, Tomaž Erjavec, hrWaC and slWac: compiling web corpora for Croatian and Slovene text speech and dialogue. pp. 395- 402 ,(2011) , 10.1007/978-3-642-23538-2_50
Eric Gaussier, Unsupervised learning of derivational morphology from inflectional lexicons Unsupervised Learning in Natural Language Processing. ,(1999)
Benoît Sagot, Automatic Acquisition of a Slovak Lexicon from a Raw Corpus Text, Speech and Dialogue. pp. 156- 163 ,(2005) , 10.1007/11551874_20
Radoslaw Ramocki, Maciej Piasecki, Marek Maziarz, Recognition of Polish Derivational Relations Based on Supervised Learning Scheme language resources and evaluation. pp. 916- 922 ,(2012)
Péter Halácsy, András Kornai, Csaba Oravecz, HunPos: an open source trigram tagger meeting of the association for computational linguistics. pp. 209- 212 ,(2007) , 10.3115/1557769.1557830
Karel Pala, Dana Hlaváčková, Derivational Relations in Czech WordNet meeting of the association for computational linguistics. pp. 75- 81 ,(2007) , 10.3115/1567545.1567559
Marko Tadić, Sanja Fulgosi, Building the Croatian morphological lexicon Proceedings of the 2003 EACL Workshop on Morphological Processing of Slavic Languages - MorphSlav '03. pp. 41- 45 ,(2003) , 10.3115/1613200.1613206
J. Šnajder, B. Dalbelo Bašić, M. Tadić, Automatic acquisition of inflectional lexica for morphological normalisation Information Processing & Management. ,vol. 44, pp. 1720- 1731 ,(2008) , 10.1016/J.IPM.2008.03.006
Enrique Amigó, Julio Gonzalo, Javier Artiles, Felisa Verdejo, A comparison of extrinsic clustering evaluation metrics based on formal constraints Information Retrieval. ,vol. 12, pp. 461- 486 ,(2009) , 10.1007/S10791-008-9066-8